KR102568145B1

KR102568145B1 - Method and tts system for generating speech data using unvoice mel-spectrogram

Info

Publication number: KR102568145B1
Application number: KR1020200160380A
Authority: KR
Inventors: 강진범; 주동원; 남용욱
Original assignee: 주식회사 자이냅스
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2023-08-18
Anticipated expiration: 2040-11-25
Also published as: KR20220072593A

Abstract

화자 정보 및 텍스트를 이용하여 음성 데이터를 생성하는 방법은, 화자 정보 및 텍스트에 기초하여 복수의 서브 멜-스펙트로그램을 생성하는 단계, 및 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성하는 단계를 포함할 수 있다.
음성 데이터를 생성하는 단계는, 복수의 서브 멜-스펙트로그램 사이에 무음 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성하는 단계, 및 최종 멜-스펙트로그램으로부터 음성 데이터를 생성하는 단계를 포함할 수 있다.A method of generating voice data using speaker information and text includes generating a plurality of sub-mel-spectrograms based on the speaker information and text, and generating voice data from the plurality of sub-mel-spectrograms. can include
The step of generating voice data includes generating a final mel-spectrogram by adding a silent mel-spectrogram between a plurality of sub mel-spectrograms, and generating voice data from the final mel-spectrogram. can do.

Description

METHOD AND TTS SYSTEM FOR GENERATING SPEECH DATA USING UNVOICE MEL-SPECTROGRAM

본 개시는 무음 멜-스펙트로그램을 이용하여 음성 데이터를 생성하는 방법 및 음성 합성 시스템에 관한다.The present disclosure relates to a method and a speech synthesis system for generating speech data using a silent mel-spectrogram.

최근 인공 지능 기술의 발달로 음성 신호를 활용하는 인터페이스가 보편화되고 있다. 이에 따라, 주어진 상황에 따라 합성된 음성을 발화할 수 있도록 하는 음성 합성(speech synthesis) 기술에 대한 연구가 활발히 진행되고 있다.Recently, with the development of artificial intelligence technology, an interface using a voice signal is becoming common. Accordingly, research on a speech synthesis technology capable of uttering a synthesized voice according to a given situation is being actively conducted.

음성 합성 기술은 인공 지능에 기반한 음성 인식 기술과 접목하여 가상 비서, 오디오북, 자동 통번역 및 가상 성우 등의 많은 분야에 적용되고 있다. Speech synthesis technology is applied to many fields, such as virtual assistants, audio books, automatic interpretation and translation, and virtual voice actors, in conjunction with artificial intelligence-based voice recognition technology.

종래의 음성 합성 방법으로는 연결 합성(Unit Selection Synthesis, USS) 및 통계 기반 파라미터 합성(HMM-based Speech Synthesis, HTS) 등의 다양한 방법이 있다. USS 방법은 음성 데이터를 음소 단위로 잘라서 저장하고 음성 합성 시 발화에 적합한 음편을 찾아서 이어붙이는 방법이고, HTS 방법은 음성 특징에 해당하는 파라미터들을 추출해 통계 모델을 생성하고 통계 모델에 기반하여 텍스트를 음성으로 재구성하는 방법이다. 그러나, 상술한 종래의 음성 합성 방법은 화자의 발화 스타일 또는 감정 표현 등을 반영한 자연스러운 음성을 합성하는 데 많은 한계가 있었다. Conventional speech synthesis methods include various methods such as Unit Selection Synthesis (USS) and HMM-based Speech Synthesis (HTS). The USS method is a method of cutting and storing voice data in phoneme units and searching for and attaching sound pieces suitable for speech during speech synthesis. way to reconstruct it. However, the conventional voice synthesis method described above has many limitations in synthesizing a natural voice reflecting a speaker's speech style or emotional expression.

이에 따라, 최근에는 인공 신경망(Artificial Neural Network)에 기반하여 텍스트로부터 음성을 합성하는 음성 합성 방법이 주목받고 있다.Accordingly, recently, a speech synthesis method for synthesizing speech from text based on an artificial neural network is attracting attention.

실제 발화자가 말하는 듯한 자연스러운 음성을 구현할 수 있는 인공 지능 기반의 음성 합성 기술을 제공하는 데 있다. 또한, 실제 발화자가 말하는 듯한 자연스러운 음성을 구현할 수 있는 인공 지능 기반의 음성 합성 기술을 제공하는 데 있다. 또한, 적은 양의 학습 데이터를 이용하는 고효율의 인공 지능 기반의 음성 합성 기술을 제공하는 데 있다.It is an object of the present invention to provide an artificial intelligence-based voice synthesis technology capable of realizing a natural voice as if a real speaker is speaking. In addition, it is an object of the present invention to provide an artificial intelligence-based voice synthesis technology capable of realizing a natural voice as if a real speaker is speaking. In addition, it is to provide a high-efficiency artificial intelligence-based voice synthesis technology using a small amount of learning data.

해결하고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 유추될 수 있다.The technical problem to be solved is not limited to the technical problems as described above, and other technical problems can be inferred.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 화자 정보 및 텍스트를 이용하여 음성 데이터를 생성하는 방법에 있어서, 화자 정보 및 텍스트에 기초하여 복수의 서브 멜-스펙트로그램을 생성하는 단계; 및 상기 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성하는 단계;를 포함하고, 상기 음성 데이터를 생성하는 단계는, 상기 복수의 서브 멜-스펙트로그램 사이에 무음 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성하는 단계; 상기 최종 멜-스펙트로그램으로부터 음성 데이터를 생성하는 단계;를 포함하는, 방법을 제공할 수 있다.As a technical means for achieving the above-described technical problem, a first aspect of the present disclosure is a method for generating voice data using speaker information and text, wherein a plurality of sub-mel-spectrograms are provided based on speaker information and text. generating; and generating voice data from the plurality of sub-mel-spectrograms, wherein the generating of voice data comprises adding a silent mel-spectrogram between the plurality of sub-mel-spectrograms to obtain a final mel-spectrogram. - generating a spectrogram; It is possible to provide a method including; generating voice data from the final mel-spectrogram.

또한, 상기 최종 멜-스펙트로그램을 생성하는 단계에 있어서, 서브 멜-스펙트로그램에 대응하는 텍스트의 마지막 문자를 식별하는 단계; 상기 마지막 문자가 제1 그룹 문자인 경우, 상기 서브 멜-스펙트로그램에 제1 시간을 갖는 무음 멜-스펙트로그램을 추가하고, 상기 마지막 문자가 제2 그룹 문자인 경우, 상기 서브 멜-스펙트로그램에 제2 시간을 갖는 무음 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성하는 단계;를 포함하는, 방법을 제공할 수 있다.Also, in the generating of the final mel-spectrogram, identifying the last character of the text corresponding to the sub mel-spectrogram; If the last character is a first group character, a silent mel-spectrogram having a first time is added to the sub-mel-spectrogram, and if the last character is a second-group character, to the sub-mel-spectrogram generating a final mel-spectrogram by adding a silent mel-spectrogram having a second time;

또한, 상기 방법은, 상기 화자 정보로써 숨소리 데이터를 획득하는 단계;를 더 포함하고, 상기 음성 데이터를 생성하는 단계는, 상기 복수의 서브 멜-스펙트로그램 사이에 상기 숨소리 데이터로부터 생성한 숨소리 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성하는 단계; 및 상기 최종 멜-스펙트로그램으로부터 음성 데이터를 생성하는 단계;를 포함하는, 방법을 제공할 수 있다.The method further includes obtaining breath data as the speaker information, and the generating of the voice data includes the breath sound generated from the breath data between the plurality of sub-mel-spectrograms. adding spectrograms to generate a final mel-spectrogram; and generating voice data from the final mel-spectrogram.

전술한 본 개시의 과제 해결 수단에 의하면, 멜-스펙트로그램의 무음 부분을 기준으로 멜-스펙트로그램을 서브 멜-스펙트로그램으로 분할하고, 서브 멜-스펙트로그램으로부터 음성 데이터를 생성함으로써, 보다 정확성 높은 음성 데이터를 생성할 수 있다.According to the above-mentioned problem solving means of the present disclosure, by dividing the mel-spectrogram into sub-mel-spectrograms based on the silent part of the mel-spectrogram and generating voice data from the sub-mel-spectrogram, more accurate Voice data can be generated.

본 개시의 다른 과제 해결 수단 중 하나에 의하면, 무음 멜-스펙트로그램을 이용하여 음성 데이터를 생성함으로써, 보다 정확성 높은 음성 데이터를 생성할 수 있다.According to one of the means for solving other problems of the present disclosure, more accurate voice data can be generated by generating voice data using a silent mel-spectrogram.

도 1은 음성 합성 시스템의 동작을 개략적으로 나타내는 도면이다.
도 2는 음성 합성 시스템의 일 실시예를 나타내는 도면이다.
도 3은 음성 합성 시스템의 합성기의 일 실시예를 나타내는 도면이다.
도 4는 합성기를 통해 멜 스펙트로그램을 출력하는 일 실시예를 나타내는 도면이다.
도 5는 일 실시예에 따른 멜-스펙트로그램에 대응하는 볼륨 그래프를 도시한 도면이다.
도 6a 내지 도 6b는 일 실시예에 따른 멜-스펙트로그램을 복수의 서브 멜-스펙트로그램으로 분할하는 과정을 설명하기 위한 도면이다.
도 7은 일 실시예에 따른 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성하는 과정을 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 텍스트 시퀀스를 분할하는 예시를 설명하기 위한 도면이다.
도 9는 일 실시예에 따른 서브 멜-스펙트로그램 사이에 무음 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성하는 과정을 설명하기 위한 도면이다.
도 10은 일 실시예에 따른 멜-스펙트로그램의 무음 부분을 결정하는 방법의 흐름도이다.1 is a diagram schematically illustrating the operation of a speech synthesis system.
2 is a diagram illustrating an embodiment of a speech synthesis system.
3 is a diagram illustrating an embodiment of a synthesizer of a speech synthesis system.
4 is a diagram illustrating an embodiment of outputting a Mel spectrogram through a synthesizer.
5 is a diagram illustrating a volume graph corresponding to a mel-spectrogram according to an embodiment.
6A and 6B are diagrams for explaining a process of dividing a mel-spectrogram into a plurality of sub mel-spectrograms according to an embodiment.
7 is a diagram for explaining a process of generating voice data from a plurality of sub-MEL-spectrograms according to an embodiment.
8 is a diagram for explaining an example of segmenting a text sequence according to an exemplary embodiment.
9 is a diagram for explaining a process of generating a final mel-spectrogram by adding a silent mel-spectrogram between sub mel-spectrograms according to an embodiment.
10 is a flowchart of a method for determining a silent portion of a Mel-spectrogram according to an embodiment.

본 실시예들에서 사용되는 용어는 본 실시예들에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 실시예들에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 실시예들 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the present embodiments have been selected from general terms that are currently widely used as much as possible while considering the functions in the present embodiments, but these may vary depending on the intention of a person skilled in the art or a precedent, the emergence of new technologies, etc. there is. In addition, in a specific case, there are also terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the relevant part. Therefore, the term used in the present embodiments should be defined based on the meaning of the term and the general content of the present embodiment, not a simple name of the term.

본 실시예들은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 일부 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 실시예들을 특정한 개시형태에 대해 한정하려는 것이 아니며, 본 실시예들의 사상 및 기술범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 명세서에서 사용한 용어들은 단지 실시예들의 설명을 위해 사용된 것으로, 본 실시예들을 한정하려는 의도가 아니다.Since the present embodiments can have various changes and various forms, some embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present embodiments to a specific disclosure, and should be understood to include all changes, equivalents, or substitutes included in the spirit and scope of the present embodiments. Terms used in this specification are only used for description of the embodiments, and are not intended to limit the embodiments.

본 실시예들에 사용되는 용어들은 다르게 정의되지 않는 한, 본 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 실시예들에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않아야 한다.Terms used in the present embodiments have the same meaning as commonly understood by a person of ordinary skill in the art to which the present embodiments belong, unless otherwise defined. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present embodiments, in an ideal or excessively formal meaning. should not be interpreted.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이러한 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 본 명세서에 기재되어 있는 특정 형상, 구조 및 특성은 본 발명의 정신과 범위를 벗어나지 않으면서 일 실시예로부터 다른 실시예로 변경되어 구현될 수 있다. 또한, 각각의 실시예 내의 개별 구성요소의 위치 또는 배치도 본 발명의 정신과 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 행하여지는 것이 아니며, 본 발명의 범위는 특허청구범위의 청구항들이 청구하는 범위 및 그와 균등한 모든 범위를 포괄하는 것으로 받아들여져야 한다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 구성요소를 나타낸다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable any person skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented from one embodiment to another without departing from the spirit and scope of the present invention. It should also be understood that the location or arrangement of individual components within each embodiment may be changed without departing from the spirit and scope of the present invention. Therefore, the detailed description to be described later is not performed in a limiting sense, and the scope of the present invention should be taken as encompassing the scope claimed by the claims and all scopes equivalent thereto. Like reference numbers in the drawings indicate the same or similar elements throughout the various aspects.

한편, 본 명세서에서 하나의 도면 내에서 개별적으로 설명되는 기술적 특징은 개별적으로 구현될 수도 있고, 동시에 구현될 수도 있다.Meanwhile, technical features individually described in one drawing in this specification may be implemented individually or simultaneously.

이하에서는, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 여러 실시예에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention.

도 1은 음성 합성 시스템의 동작을 개략적으로 나타내는 도면이다. 1 is a diagram schematically illustrating the operation of a speech synthesis system.

음성 합성(Speech Synthesis) 장치는 텍스트를 사람의 음성으로 변환하는 장치이다. A speech synthesis device is a device that converts text into human voice.

예를 들어, 도 1의 음성 합성 시스템(100)은 인공 신경망(Artificial Neural Network) 기반의 음성 합성 시스템일 수 있다. 인공 신경망은 시냅스의 결합으로 네트워크를 형성한 인공 뉴런이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미한다. For example, the voice synthesizing system 100 of FIG. 1 may be an artificial neural network based voice synthesizing system. An artificial neural network refers to a model in general that has problem-solving ability by changing synaptic coupling strength through learning of artificial neurons that form a network through synaptic coupling.

음성 합성 시스템(100)은 PC(personal computer), 서버 디바이스, 모바일 디바이스, 임베디드 디바이스 등의 다양한 종류의 디바이스들로 구현될 수 있고, 구체적인 예로서 인공 신경망를 이용하여 음성 합성을 수행하는 스마트폰, 태블릿 디바이스, AR(Augmented Reality) 디바이스, IoT(Internet of Things) 디바이스, 자율주행 자동차, 로보틱스, 의료기기, 전자책 단말기 및 네비게이션 등에 해당될 수 있으나, 이에 제한되지 않는다. The voice synthesis system 100 may be implemented in various types of devices such as a personal computer (PC), a server device, a mobile device, and an embedded device, and as a specific example, a smartphone or a tablet that performs voice synthesis using an artificial neural network. Devices, Augmented Reality (AR) devices, Internet of Things (IoT) devices, self-driving cars, robotics, medical devices, e-book readers, and navigation devices may be used, but are not limited thereto.

나아가서, 음성 합성 시스템(100)은 위와 같은 디바이스에 탑재되는 전용 하드웨어 가속기(HW accelerator)에 해당될 수 있다. 또는, 음성 합성 시스템(100)은 인공 신경망의 구동을 위한 전용 모듈인 NPU(neural processing unit), TPU(Tensor Processing Unit), Neural Engine 등과 같은 하드웨어 가속기일 수 있으나, 이에 제한되지 않는다.Furthermore, the voice synthesis system 100 may correspond to a dedicated hardware accelerator (HW accelerator) mounted in the above device. Alternatively, the speech synthesis system 100 may be a hardware accelerator such as a neural processing unit (NPU), a tensor processing unit (TPU), or a neural engine, which are dedicated modules for driving an artificial neural network, but is not limited thereto.

도 1을 참고하면, 음성 합성 시스템(100)은 텍스트 입력과 특정 화자 정보를 수신할 수 있다. 예를 들어, 음성 합성 시스템(100)은 텍스트 입력으로써 도 1에 도시된 바와 같이 "Have a good day!"를 수신할 수 있고, 화자 정보 입력으로써 "화자 1"을 수신할 수 있다. Referring to FIG. 1 , the speech synthesis system 100 may receive text input and specific speaker information. For example, the speech synthesis system 100 may receive “Have a good day!” as a text input as shown in FIG. 1 and may receive “speaker 1” as a speaker information input.

"화자 1"은 기 설정된 화자 1의 발화 특징을 나타내는 음성 신호 또는 음성 샘플에 해당할 수 있다. 예를 들어, 화자 정보는 음성 합성 시스템(100)에 포함된 통신부를 통해 외부 장치로부터 수신될 수 있다. 또는, 화자 정보는 음성 합성 시스템(100)의 사용자 인터페이스를 통해 사용자로부터 입력될 수 있고, 음성 합성 시스템(100)의 데이터 베이스에 미리 저장된 다양한 화자 정보들 중 하나로 선택될 수도 있으나, 이에 제한되는 것은 아니다.“Speaker 1” may correspond to a voice signal or a voice sample representing a preset speech characteristic of speaker 1. For example, speaker information may be received from an external device through a communication unit included in the voice synthesis system 100 . Alternatively, speaker information may be input from a user through the user interface of the speech synthesis system 100 and may be selected from among various types of speaker information pre-stored in the database of the speech synthesis system 100, but is not limited thereto. no.

음성 합성 시스템(100)은 입력으로 수신한 텍스트 입력과 특정 화자 정보에 기초하여 음성(speech)를 출력할 수 있다. 예를 들어, 음성 합성 시스템(100)은 "Have a good day!" 및 "화자 1"을 입력으로 수신하여, 화자 1의 발화 특징이 반영된 "Have a good day!"에 대한 음성을 출력할 수 있다. 화자 1의 발화 특징은 화자 1의 음성, 운율, 음높이 및 감정 등 다양한 요소들 중 적어도 하나를 포함할 수 있다. 즉, 출력되는 음성은 화자 1이 "Have a good day!"를 자연스럽게 발음하는 듯한 음성일 수 있다. 음성 합성 시스템(100)의 구체적인 동작은 도 2 내지 도 4에서 후술한다. The speech synthesis system 100 may output speech based on the received text input and specific speaker information. For example, the speech synthesis system 100 may "Have a good day!" and “Speaker 1” are received as an input, and a voice for “Have a good day!” reflecting the speech characteristics of Speaker 1 may be output. The speech characteristics of speaker 1 may include at least one of various factors such as voice, prosody, pitch, and emotion of speaker 1. That is, the output voice may be a voice as if speaker 1 naturally pronounces "Have a good day!" A detailed operation of the voice synthesis system 100 will be described later with reference to FIGS. 2 to 4 .

도 2는 음성 합성 시스템의 일 실시예를 나타내는 도면이다. 도 2의 음성 합성 시스템(200)은 도 1의 음성 합성 시스템(100)과 동일할 수 있다.2 is a diagram illustrating an embodiment of a speech synthesis system. The voice synthesis system 200 of FIG. 2 may be the same as the voice synthesis system 100 of FIG. 1 .

도 2를 참조하면, 음성 합성 시스템(200)은 화자 인코더(210), 합성기(220) 및 보코더(230)를 포함할 수 있다. 한편, 도 2에 도시된 음성 합성 시스템(200)에는 일 실시예와 관련된 구성요소들만이 도시되어 있다. 따라서, 음성 합성 시스템(200)에는 도 2에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.Referring to FIG. 2 , a speech synthesis system 200 may include a speaker encoder 210 , a synthesizer 220 and a vocoder 230 . Meanwhile, in the voice synthesizing system 200 shown in FIG. 2, only components related to one embodiment are shown. Accordingly, it is obvious to those skilled in the art that the voice synthesis system 200 may further include other general-purpose components in addition to the components shown in FIG. 2 .

도 2의 음성 합성 시스템(200)은 화자 정보 및 텍스트(text)를 입력으로 수신하여 음성(speech)를 출력할 수 있다. The voice synthesis system 200 of FIG. 2 may output speech by receiving speaker information and text as inputs.

예를 들어, 음성 합성 시스템(200)의 화자 인코더(210)는 화자 정보를 입력으로 수신하여 화자 임베딩 벡터(embedding vector)를 생성할 수 있다. 화자 정보는 화자의 음성 신호 또는 음성 샘플에 해당할 수 있다. 화자 인코더(210)는 화자의 음성 신호 또는 음성 샘플을 수신하여, 화자의 발화 특징을 추출할 수 있으며 이를 임베딩 벡터로 나타낼 수 있다. For example, the speaker encoder 210 of the speech synthesis system 200 may receive speaker information as an input and generate a speaker embedding vector. Speaker information may correspond to a speaker's voice signal or voice sample. The speaker encoder 210 may receive the speaker's voice signal or voice sample, extract the speaker's speech characteristics, and represent the speech characteristics as an embedding vector.

화자의 발화 특징은 발화 속도, 휴지 구간, 음높이, 음색, 운율, 억양 또는 감정 등 다양한 요소들 중 적어도 하나를 포함할 수 있다. 즉, 화자 인코더(210)는 화자 정보에 포함된 불연속적인 데이터 값을 연속적인 숫자로 구성된 벡터로 나타낼 수 있다. 예를 들어, 화자 인코더(210)는 pre-net, CBHG 모듈, DNN(Deep Neural Network), CNN(convolutional neural network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 화자 임베딩 벡터를 생성할 수 있다. The speaker's speech characteristics may include at least one of various factors such as speech speed, pause period, pitch, timbre, prosody, intonation, or emotion. That is, the speaker encoder 210 may represent discontinuous data values included in speaker information as vectors composed of consecutive numbers. For example, the speaker encoder 210 includes a pre-net, a CBHG module, a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory network (LSTM), and a BRDNN ( A speaker embedding vector may be generated based on at least one or a combination of two or more of various artificial neural network models such as Bidirectional Recurrent Deep Neural Network).

예를 들어, 음성 합성 시스템(200)의 합성기(220)는 텍스트(text) 및 화자의 발화 특징을 나타내는 임베딩 벡터를 입력으로 수신하여 스펙트로그램(spectrogram)을 출력할 수 있다. For example, the synthesizer 220 of the speech synthesis system 200 may output a spectrogram by receiving text and an embedding vector representing speech characteristics of a speaker as inputs.

도 3은 음성 합성 시스템의 합성기의 일 실시예를 나타내는 도면이다. 도 3의 합성기(300)는 도 2의 합성기(220)와 동일할 수 있다. 3 is a diagram illustrating an embodiment of a synthesizer of a speech synthesis system. The synthesizer 300 of FIG. 3 may be the same as the synthesizer 220 of FIG. 2 .

도 3을 참조하면, 음성 합성 시스템(200)의 합성기(300)는 인코더 및 디코더를 포함할 수 있다. 한편, 합성기(300)에는 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.Referring to FIG. 3 , the synthesizer 300 of the speech synthesis system 200 may include an encoder and a decoder. Meanwhile, it is obvious to those skilled in the art that the synthesizer 300 may further include other general-purpose components.

화자의 발화 특징을 나타내는 임베딩 벡터는 상술한 바와 같이 화자 인코더(210)로부터 생성될 수 있으며, 합성기(300)의 인코더 또는 디코더는 화자 인코더(210)로부터 화자의 발화 특징을 나타내는 임베딩 벡터를 수신할 수 있다. The embedding vector representing the speech characteristics of the speaker may be generated by the speaker encoder 210 as described above, and the encoder or decoder of the synthesizer 300 may receive the embedding vector representing the speech characteristics of the speaker from the speaker encoder 210. can

합성기(300)의 인코더는 텍스트를 입력으로 수신하여 텍스트 임베딩 벡터를 생성할 수 있다. 텍스트는 특정 자연 언어로 된 문자들의 시퀀스를 포함할 수 있다. 예를 들어, 문자들의 시퀀스는 알파벳 문자들, 숫자들, 문장 부호들 또는 기타 특수 문자들을 포함할 수 있다. The encoder of synthesizer 300 may receive text as an input and generate a text embedding vector. Text may contain sequences of characters in a particular natural language. For example, a sequence of characters may include alphabetic characters, numbers, punctuation marks, or other special characters.

합성기(300)의 인코더는 입력된 텍스트를 자모 단위, 글자 단위 또는 음소 단위로 분리할 수 있고, 분리된 텍스트를 인공 신경망 모델에 입력할 수 있다. 예를 들어, 합성기(300)의 인코더는 pre-net, CBHG 모듈, DNN, CNN, RNN, LSTM, BRDNN 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 텍스트 임베딩 벡터를 생성할 수 있다. The encoder of the synthesizer 300 may divide input text into consonant units, character units, or phoneme units, and may input the separated text into an artificial neural network model. For example, the encoder of the synthesizer 300 may generate a text embedding vector based on at least one or a combination of two or more of various artificial neural network models such as pre-net, CBHG module, DNN, CNN, RNN, LSTM, and BRDNN. .

또는, 합성기(300)의 인코더는 입력된 텍스트를 복수의 짧은 텍스트들로 분리하고, 짧은 텍스트들 각각에 대하여 복수의 텍스트 임베딩 벡터들을 생성할 수도 있다. Alternatively, the encoder of the synthesizer 300 may divide the input text into a plurality of short texts and generate a plurality of text embedding vectors for each of the short texts.

합성기(300)의 디코더는 화자 인코더(210)로부터 화자 임베딩 벡터 및 텍스트 임베딩 벡터를 입력으로 수신할 수 있다. 또는, 합성기(300)의 디코더는 화자 인코더(210)로부터 화자 임베딩 벡터를 입력으로 수신하고, 합성기(300)의 인코더로부터 텍스트 임베딩 벡터를 입력으로 수신할 수 있다. The decoder of the synthesizer 300 may receive the speaker embedding vector and the text embedding vector from the speaker encoder 210 as inputs. Alternatively, the decoder of the synthesizer 300 may receive the speaker embedding vector from the speaker encoder 210 as an input and the text embedding vector from the encoder of the synthesizer 300 as an input.

합성기(300)의 디코더는 화자 임베딩 벡터와 텍스트 임베딩 벡터를 인공 신경망 모델에 입력하여, 입력된 텍스트에 대응되는 스펙트로그램을 생성할 수 있다. 즉, 합성기(300)의 디코더는 화자의 발화 특징이 반영된 입력 텍스트에 대한 스펙트로그램을 생성할 수 있다. 예를 들면, 스펙트로그램은 멜 스펙트로그램(mel-spectrogram)에 해당할 수 있으나, 이에 제한되는 것은 아니다.The decoder of the synthesizer 300 may generate a spectrogram corresponding to the input text by inputting the speaker embedding vector and the text embedding vector to the artificial neural network model. That is, the decoder of the synthesizer 300 may generate a spectrogram of the input text in which speech characteristics of the speaker are reflected. For example, the spectrogram may correspond to a mel-spectrogram, but is not limited thereto.

스펙트로그램은 음성 신호의 스펙트럼을 시각화하여 그래프로 표현한 것이다. 스펙트로그램의 x축은 시간, y축은 주파수를 나타내며 각 시간당 주파수가 가지는 값을 값의 크기에 따라 색으로 표현할 수 있다. 스펙토그램은 연속적으로 주어지는 음성 신호에 STFT(Short-time Fourier transform)를 수행한 결과물일 수 있다. A spectrogram is a graph that visualizes the spectrum of a voice signal. The x-axis of the spectrogram represents time, and the y-axis represents frequency, and the value of each frequency per time can be expressed in color according to the size of the value. The spectogram may be a result of performing short-time Fourier transform (STFT) on a continuously given audio signal.

STFT는 음성 신호를 일정한 길이의 구간들로 나누고 각 구간에 대하여 푸리에 변환을 적용하는 방법이다. 이 때, 음성 신호에 STFT를 수행한 결과물은 복소수 값이기 때문에, 복소수 값에 절대값을 취해 위상(phase) 정보를 소실시키고 크기(magnitude) 정보만을 포함하는 스펙트로그램을 생성할 수 있다. STFT is a method of dividing a speech signal into sections of a certain length and applying a Fourier transform to each section. At this time, since the result of performing the STFT on the voice signal is a complex value, it is possible to generate a spectrogram including only magnitude information by taking an absolute value of the complex value to lose phase information.

한편, 멜 스펙트로그램은 스펙트로그램의 주파수 간격을 멜 스케일(Mel Scale)로 재조정한 것이다. 사람의 청각기관은 고주파수(high frequency) 보다 저주파수(low frequency) 대역에서 더 민감하며, 이러한 특성을 반영해 물리적인 주파수와 실제 사람이 인식하는 주파수의 관계를 표현한 것이 멜 스케일이다. 멜 스펙트로그램은 멜 스케일에 기반한 필터 뱅크(filter bank)를 스펙트로그램에 적용하여 생성될 수 있다.Meanwhile, the mel spectrogram is obtained by re-adjusting the frequency interval of the spectrogram to a mel scale. The human auditory organ is more sensitive in the low frequency band than the high frequency band, and the Mel scale reflects this characteristic and expresses the relationship between the physical frequency and the frequency perceived by the actual person. The Mel spectrogram may be generated by applying a filter bank based on the Mel scale to the spectrogram.

한편, 도 3에는 도시되어 있지 않으나, 합성기(300)는 어텐션 얼라이먼트를 생성하기 위한 어텐션 모듈을 더 포함할 수 있다. 어텐션 모듈은 디코더의 특정 타임 스텝(time-step)의 출력이 인코더의 모든 타임 스텝의 출력 중 어떤 출력과 가장 연관이 있는가를 학습하는 모듈이다. 어텐션 모듈을 이용하여 더 고품질의 스펙트로그램 또는 멜 스펙트로그램을 출력할 수 있다. Meanwhile, although not shown in FIG. 3 , the synthesizer 300 may further include an attention module for generating an attention alignment. The attention module is a module that learns which output of a specific time-step of the decoder is most related to an output of all time-steps of the encoder. A higher quality spectrogram or MEL spectrogram can be output by using the attention module.

도 4는 합성기를 통해 멜 스펙트로그램을 출력하는 일 실시예를 나타내는 도면이다. 도 4의 합성기(400)는 도 3의 합성기(300)와 동일할 수 있다.4 is a diagram illustrating an embodiment of outputting a Mel spectrogram through a synthesizer. The synthesizer 400 of FIG. 4 may be the same as the synthesizer 300 of FIG. 3 .

도 4를 참조하면, 합성기(400)는 입력 텍스트들과 이에 대응되는 화자 임베딩 벡터들을 포함하는 리스트를 수신할 수 있다. 예를 들어, 합성기(400)는 'first sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice1, 'second sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice2, 'third sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice3을 포함하는 리스트(410)를 입력으로 수신할 수 있다.Referring to FIG. 4 , the synthesizer 400 may receive a list including input texts and speaker embedding vectors corresponding thereto. For example, the synthesizer 400 inputs the input text 'first sentence', the corresponding speaker embedding vector embed_voice1, the input text 'second sentence' and the corresponding speaker embedding vector embed_voice2, and the corresponding input text 'third sentence' and a list 410 including embed_voice3, which is a speaker embedding vector corresponding thereto, may be received as an input.

합성기(400)는 수신한 리스트(410)에 포함된 입력 텍스트의 개수만큼의 멜 스펙트로그램(420)을 생성할 수 있다. 도 4를 참고하면, 'first sentence', 'second sentence' 및 'third sentence' 각각의 입력 텍스트에 대응하는 멜 스펙트로그램들이 생성된 것을 알 수 있다.The combiner 400 may generate as many mel spectrograms 420 as the number of input texts included in the received list 410 . Referring to FIG. 4 , it can be seen that Mel spectrograms corresponding to each input text of 'first sentence', 'second sentence', and 'third sentence' are generated.

또는, 합성기(400)는 입력 텍스트의 개수만큼의 멜 스펙트로그램(420) 및 어텐션 얼라인먼트(attention alignment)을 함께 생성할 수 있다. 도 4에는 도시되어 있지 않으나, 예를 들어 'first sentence', 'second sentence' 및 'third sentence' 각각의 입력 텍스트에 대응하는 어텐션 얼라인먼트가 추가적으로 생성될 수 있다. 또는, 합성기(400)는 입력 텍스트들 각각에 대하여 복수의 멜 스펙트로그램 및 복수의 어텐션 얼라인먼트를 생성할 수도 있다. Alternatively, the synthesizer 400 may generate as many mel spectrograms 420 and attention alignment as the number of input texts. Although not shown in FIG. 4 , for example, attention alignments corresponding to input texts of 'first sentence', 'second sentence', and 'third sentence' may be additionally generated. Alternatively, the synthesizer 400 may generate a plurality of MEL spectrograms and a plurality of attention alignments for each of the input texts.

다시 도 2로 돌아와서, 음성 합성 시스템(200)의 보코더(230)는 합성기(220)에서 출력된 스펙트로그램을 실제 음성(speech)으로 생성할 수 있다. 상술한 바와 같이 출력된 스펙트로그램은 멜 스펙트로그램일 수 있다. Returning to FIG. 2 again, the vocoder 230 of the voice synthesis system 200 may generate a spectrogram output from the synthesizer 220 as actual speech. As described above, the output spectrogram may be a MEL spectrogram.

일 실시예에서, 보코더(230)는 ISFT(Inverse Short-Time Fourier Transform)를 이용하여 합성기(220)에서 출력된 스펙트로그램을 실제 음성 신호로 생성할 수 있다. 스펙트로그램 또는 멜 스펙트로그램은 위상 정보를 포함하고 있지 않으므로, ISFT를 이용하여 음성 신호를 생성하는 경우 스펙트로그램 또는 멜 스펙트로그램의 위상 정보는 고려되지 않는다. In one embodiment, the vocoder 230 may generate a spectrogram output from the synthesizer 220 as an actual voice signal using an inverse short-time Fourier transform (ISFT). Since the spectrogram or mel spectrogram does not include phase information, the phase information of the spectrogram or mel spectrogram is not considered when a voice signal is generated using the ISFT.

다른 실시예에서, 보코더(230)는 그리핀-림 알고리즘(Griffin-Lim algorithm)을 사용하여 합성기(220)에서 출력된 스펙트로그램을 실제 음성 신호로 생성할 수 있다. 그리핀-림 알고리즘은 스펙트로그램 또는 멜 스펙트로그램의 크기 정보에서 위상 정보 추정하는 알고리즘이다. In another embodiment, the vocoder 230 may generate a spectrogram output from the synthesizer 220 as an actual voice signal using a Griffin-Lim algorithm. The Griffin-Rim algorithm is an algorithm for estimating phase information from magnitude information of a spectrogram or a Mel spectrogram.

또는, 보코더(230)는 예를 들어 뉴럴 보코더(neural vocoder)에 기 초하여 합성기(220)에서 출력된 스펙트로그램을 실제 음성 신호로 생성할 수 있다. Alternatively, the vocoder 230 may generate a spectrogram output from the synthesizer 220 as an actual voice signal based on, for example, a neural vocoder.

뉴럴 보코더는 스펙트로그램 또는 멜 스펙트로그램을 입력으로 받아 음성 신호를 생성하는 인공 신경망 모델이다. 뉴럴 보코더는 스펙트로그램 또는 멜 스펙트로그램과 음성 신호 사이의 관계를 다량의 데이터를 통해 학습할 수 있고, 이를 통해 고품질의 실제 음성 신호를 생성할 수 있다. A neural vocoder is an artificial neural network model that generates a voice signal by receiving a spectrogram or a mel spectrogram as an input. The neural vocoder can learn a relationship between a spectrogram or a MEL spectrogram and a voice signal through a large amount of data, and through this, it can generate a real voice signal of high quality.

뉴럴 보코더는 WaveNet, Parallel WaveNet, WaveRNN, WaveGlow 또는 MelGAN 등과 같은 인공 신경망 모델에 기반한 보코더에 해당할 수 있으나, 이에 제한되는 것은 아니다. The neural vocoder may correspond to a vocoder based on an artificial neural network model such as WaveNet, Parallel WaveNet, WaveRNN, WaveGlow, or MelGAN, but is not limited thereto.

예를 들어, WaveNet 보코더는 여러 층의 dilated causal convolution layer들로 구성되며, 음성 샘플들 간의 순차적 특징을 이용하는 자기회귀(Autoregressive) 모델이다. WaveRNN 보코더는 WaveNet의 여러 층의 dilated causal convolution layer를 GRU(Gated Recurrent Unit)로 대체한 자기회귀 모델이다. WaveGlow 보코더는 가역성(invertible)을 지닌 변환 함수를 이용하여 스펙트로그램 데이터셋(x)으로부터 가우시안 분포와 같이 단순한 분포가 나오도록 학습할 수 있다. WaveGlow 보코더는 학습이 끝난 후 변환 함수의 역함수를 이용하여 가우시안 분포의 샘플로부터 음성 신호를 출력할 수 있다. For example, a WaveNet vocoder is composed of several dilated causal convolution layers and is an autoregressive model using sequential features between voice samples. WaveRNN vocoder is an autoregressive model that replaces multiple dilated causal convolution layers of WaveNet with GRU (Gated Recurrent Unit). The WaveGlow vocoder can learn to produce a simple distribution such as a Gaussian distribution from the spectrogram dataset (x) using an invertible transform function. After learning, the WaveGlow vocoder can output a voice signal from Gaussian distribution samples using the inverse function of the conversion function.

도 5는 일 실시예에 따른 멜-스펙트로그램에 대응하는 볼륨 그래프를 도시한 도면이다.5 is a diagram illustrating a volume graph corresponding to a mel-spectrogram according to an embodiment.

멜-스펙트로그램(520)은 복수의 프레임(frame)을 포함할 수 있다. 도 5를 참조하면, 멜-스펙트로그램(520)은 400개의 프레임을 포함할 수 있다. 프로세서는 각 프레임의 평균 에너지를 계산하여 볼륨 그래프(510)를 생성할 수 있다. 멜-스펙트로그램(520)의 프레임 중에서 색이 진한 부분(예를 들어, 노란색 부분)은 높은 볼륨 값을 가진다. 멜-스펙트로그램(520)으로부터 생성된 볼륨 그래프(510)는 최대 값 4.0을 갖고, 최소 값 -4.0을 가진다.The Mel-spectrogram 520 may include a plurality of frames. Referring to FIG. 5 , a Mel-spectrogram 520 may include 400 frames. The processor may generate the volume graph 510 by calculating the average energy of each frame. Among the frames of the Mel-spectrogram 520, a dark-colored portion (eg, a yellow portion) has a high volume value. The volume graph 510 generated from the mel-spectrogram 520 has a maximum value of 4.0 and a minimum value of -4.0.

프레임의 평균 에너지가 클수록 볼륨 값은 크고, 프레임의 평균 에너지가 작을수록 볼륨 값은 작다. 즉, 평균 에너지가 작은 프레임은 무음 부분에 해당할 수 있다.The larger the average energy of a frame is, the larger the volume value is, and the smaller the average energy of a frame is, the smaller the volume value is. That is, a frame having a low average energy may correspond to a silent part.

프로세서는 멜-스펙트로그램(520)에서 무음 부분을 결정할 수 있다. 프로세서는 멜-스펙트로그램(520)을 구성하는 복수의 프레임 각각에 대한 볼륨 값을 계산하여 볼륨 그래프(510)를 생성할 수 있다.The processor may determine a silent portion in the mel-spectrogram 520 . The processor may generate the volume graph 510 by calculating a volume value for each of a plurality of frames constituting the mel-spectrogram 520 .

프로세서는 복수의 프레임 중에서 볼륨 값이 제1 임계값(511) 이하인 적어도 하나의 프레임을 제1 구간(521a 내지 521f)으로 선택할 수 있다. The processor may select at least one frame having a volume value less than or equal to the first threshold value 511 from among the plurality of frames as the first intervals 521a to 521f.

일 실시예에서, 프로세서는 제1 구간(521a 내지 521f)을 멜-스펙트로그램(520)의 무음 부분으로 결정할 수 있다. 예를 들어, 제1 임계값(511)은 -3.0, -3.25, -3.5, -3.75 등일 수 있으나, 이에 제한되지 않는다. 멜-스펙트로그램(520)에 어느 정도의 노이즈가 포함되어 있는지에 따라 제1 임계값(511)은 다르게 설정될 수 있다. 노이즈가 많은 멜-스펙트로그램(520)의 경우, 제1 임계값(511)은 더 큰 값으로 설정될 수 있다.In one embodiment, the processor may determine the first period 521a to 521f as a silent part of the Mel-spectrogram 520 . For example, the first threshold value 511 may be -3.0, -3.25, -3.5, -3.75, etc., but is not limited thereto. The first threshold 511 may be set differently depending on how much noise is included in the mel-spectrogram 520 . In the case of a noisy mel-spectrogram 520, the first threshold 511 may be set to a larger value.

다른 실시예에서, 프로세서는 제1 구간(521a 내지 521f) 중에서, 프레임 개수가 제2 임계값 이상인 구간을 제2 구간(521c, 521e)으로 선택할 수 있다. 프로세서는 멜-스펙트로그램(520)의 제2 구간(521c, 521e)을 무음 부분으로 결정할 수 있다. 예를 들어, 제2 임계값은 3, 4, 5, 6, 7 등일 수 있으나, 이에 제한되지 않는다. 멜-스펙트로그램(520)으로 음성을 만들 때 보코더 중 하나인 WaveRNN에서 설정된 overlap 값 및 hop size에 기초하여 제2 임계값이 결정될 수 있다. overlap 은 WaveRNN에서 음성 데이터를 만들 때 배치 사이에 크로스페이딩(crossfading) 하는 길이를 의미한다. 예를 들어, overlap 값이 1200이고 hop size가 300인 경우, 연속된 4개의 프레임의 볼륨 값이 제1 임계값(511) 이하여야 바람직하므로 제2 임계값은 4 또는 5로 설정될 수 있다.In another embodiment, the processor may select, as the second intervals 521c and 521e, intervals in which the number of frames is equal to or greater than the second threshold value among the first intervals 521a to 521f. The processor may determine the second sections 521c and 521e of the Mel-spectrogram 520 as the silent part. For example, the second threshold may be 3, 4, 5, 6, 7, etc., but is not limited thereto. When generating a voice with the Mel-spectrogram 520, a second threshold value may be determined based on an overlap value and a hop size set in WaveRNN, which is one of the vocoders. Overlap means the length of crossfading between batches when creating voice data in WaveRNN. For example, when the overlap value is 1200 and the hop size is 300, the second threshold value may be set to 4 or 5 because it is desirable that the volume value of four consecutive frames be less than the first threshold value 511.

도 5를 참조하면, 프로세서는 멜-스펙트로그램(520)의 [[123, 132, 141], [280, 283, 286]]의 구간을 무음 부분으로 결정할 수 있다. 상기 리스트는 [무음 부분 시작 지점, 중간 값, 무음 부분 끝 지점]을 의미한다. 한편, 첫 번째 제1 구간(521a)에 포함된 프레임들의 볼륨 값이 제1 임계값(511) 이하이고, 프레임 개수가 제2 임계값 이상이더라도, 첫 번째 제1 구간(521a)은 목소리가 시작되는 구간이므로 무음 부분에서 제외할 수 있다. 다만, 프로세서는 추후 멜-스펙트로그램(520)으로부터 음성 데이터를 생성할 때, 목소리의 시작 지점을 첫 번째 제1 구간(521a) 이후로 설정할 수 있다.Referring to FIG. 5 , the processor may determine sections [[123, 132, 141], [280, 283, 286]] of the Mel-spectrogram 520 as the silent part. The above list means [silent part start point, middle value, silent part end point]. On the other hand, even if the volume value of the frames included in the first section 521a is less than the first threshold value 511 and the number of frames is greater than or equal to the second threshold value, the first section 521a starts with a voice. Since it is a section, it can be excluded from the silent part. However, when generating voice data from the Mel-spectrogram 520 later, the processor may set the starting point of the voice to after the first section 521a.

도 6a 내지 도 6b는 일 실시예에 따른 멜-스펙트로그램을 복수의 서브 멜-스펙트로그램으로 분할하는 과정을 설명하기 위한 도면이다.6A and 6B are diagrams for explaining a process of dividing a mel-spectrogram into a plurality of sub mel-spectrograms according to an embodiment.

도 6a의 멜-스펙트로그램(620)은 도 5의 멜-스펙트로그램(520)에 대응될 수 있다.The Mel-spectrogram 620 of FIG. 6A may correspond to the Mel-spectrogram 520 of FIG. 5 .

프로세서는 도 5에서 무음 부분으로 결정된 제2 구간(521c, 521e)에 기초하여 멜-스펙트로그램(620)을 복수의 서브 멜-스펙트로그램(631, 632, 633)으로 분할할 수 있다. 첫 번째 제2 구간(521c)은 [123, 132, 141]이고, 두 번째 제2 구간(521e)이 [280, 283, 286]인 경우, 프로세서는 첫 번째 제2 구간(521c)의 중간 값을 제1 분할 지점(641)으로 결정하고, 두 번째 제2 구간(521e)의 중간 값을 제2 분할 지점(642)으로 결정할 수 있다.The processor may divide the mel-spectrogram 620 into a plurality of sub mel-spectrograms 631 , 632 , and 633 based on the second sections 521c and 521e determined as the silent parts in FIG. 5 . When the first second interval 521c is [123, 132, 141] and the second second interval 521e is [280, 283, 286], the processor determines the middle value of the first second interval 521c. may be determined as the first splitting point 641, and the middle value of the second section 521e may be determined as the second splitting point 642.

프로세서는 제1 분할 지점(641) 및 제2 분할 지점(642)을 기준으로 멜-스펙트로그램(620)을 분할함으로써 복수의 서브 멜-스펙트로그램(631, 632, 633)을 생성할 수 있다.The processor may generate a plurality of sub mel-spectrograms 631, 632, and 633 by dividing the mel-spectrogram 620 based on the first dividing point 641 and the second dividing point 642.

프로세서는 복수의 서브 멜-스펙트로그램(631, 632, 633) 각각의 길이를 계산할 수 있다. 프로세서에서 멜-스펙트로그램(620)의 무음 부분을 기준으로 멜-스펙트로그램(620)을 복수의 서브 멜-스펙트로그램(631, 632, 633)로 분할하였으므로, 복수의 서브 멜-스펙트로그램(631, 632, 633)은 서로 다른 길이를 가질 수 있다.The processor may calculate the length of each of the plurality of sub-MEL-spectrograms 631, 632, and 633. Since the processor divides the Mel-spectrogram 620 into a plurality of sub-Mel-spectrograms 631, 632, and 633 based on the silent part of the Mel-spectrogram 620, a plurality of sub-Mel-spectrograms 631 , 632, 633) may have different lengths.

도 6a를 참조하면, 제1 서브 멜-스펙트로그램(631)은 [0, 132] 구간에 해당하는 132의 길이를 갖고, 제2 서브 멜-스펙트로그램(632)은 [133, 283] 구간에 해당하는 150의 길이를 가지며, 제3 서브 멜-스펙트로그램(633)은 [284, 398]에 해당하는 114의 길이를 가진다.Referring to FIG. 6A, the first sub-Mel-spectrogram 631 has a length of 132 corresponding to the interval [0, 132], and the second sub-Mel-spectrogram 632 has a length of [133, 283]. It has a length of 150, and the third sub-Mel-spectrogram 633 has a length of 114, corresponding to [284, 398].

도 6b를 참조하면, 프로세서는 복수의 서브 멜-스펙트로그램(631, 632, 633)의 길이가 기준 배치(batch) 길이와 동일해지도록, 복수의 서브 멜-스펙트로그램(631, 632, 633)을 후처리할 수 있다. 일 실시예에서 기준 배치 길이는 기설정된 값일 수 있다. Referring to FIG. 6B, the processor generates a plurality of sub-MEL-spectrograms 631, 632, and 633 such that the lengths of the plurality of sub-MEL-spectrograms 631, 632, and 633 are equal to the length of a reference batch. can be post-processed. In one embodiment, the reference batch length may be a preset value.

다른 실시예에서 기준 배치 길이는 복수의 서브 멜-스펙트로그램(631, 632, 633) 중에서 가장 긴 길이를 갖는 서브 멜-스펙트로그램의 길이로 설정될 수 있다. 예를 들어, 제1 서브 멜-스펙트로그램(631)의 길이가 132, 제2 서브 멜-스펙트로그램(632)의 길이가 150, 그리고 제3 서브 멜-스펙트로그램(633)의 길이가 114인 경우, 기준 배치 길이는 150으로 설정될 수 있다.In another embodiment, the reference batch length may be set to the length of the sub-Mel-spectrogram having the longest length among the plurality of sub-MEL-spectrograms 631, 632, and 633. For example, if the length of the first sub-Mel-spectrogram 631 is 132, the length of the second sub-Mel-spectrogram 632 is 150, and the length of the third sub-Mel-spectrogram 633 is 114. In this case, the reference batch length may be set to 150.

프로세서는 복수의 서브 멜-스펙트로그램(631, 632, 633)의 길이가 기준 배치(batch) 길이와 동일해지도록, 기준 배치 길이 미만의 길이를 갖는 서브 멜-스펙트로그램에 대해 제로-패딩(zero-padding)을 적용할 수 있다. 예를 들어, 기준 배치 길이가 150으로 설정된 경우, 프로세서는 제1 서브 멜-스펙트로그램(631) 및 제3 서브 멜-스펙트로그램(633)에 대해 제로-패딩을 적용할 수 있다.The processor performs zero-padding on sub-Mel-spectrograms having a length less than the reference batch length so that the lengths of the plurality of sub-MEL-spectrograms 631, 632, and 633 are equal to the reference batch length. -padding) can be applied. For example, when the reference batch length is set to 150, the processor may apply zero-padding to the first sub-MEL-spectrogram 631 and the third sub-MEL-spectrogram 633.

도 6b를 참조하면, 후처리된 복수의 서브 멜-스펙트로그램(651, 652, 653)은 모두 동일한 길이(예를 들어, 150)를 가지게 된다.Referring to FIG. 6B, the plurality of post-processed sub-MEL-spectrograms 651, 652, and 653 all have the same length (eg, 150).

도 7은 일 실시예에 따른 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성하는 과정을 설명하기 위한 도면이다.7 is a diagram for explaining a process of generating voice data from a plurality of sub-MEL-spectrograms according to an embodiment.

도 7을 참조하면, 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753)은 모두 동일한 길이(예를 들어, 150)를 갖는다. 도 7의 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753) 각각은, 도 6의 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753)에 대응될 수 있다.Referring to FIG. 7 , a plurality of post-processed sub-MEL-spectrograms 751, 752, and 753 all have the same length (eg, 150). Each of the plurality of post-processed sub-MEL-spectrograms 751, 752, and 753 of FIG. 7 may correspond to the plurality of post-processed sub-MEL-spectrograms 751, 752, and 753 of FIG.

프로세서는 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753) 각각으로부터 복수의 서브 음성 데이터(761, 762, 763)를 생성할 수 있다. 예를 들어, 프로세서는 ISFT 또는 그리핀-림 알고리즘을 이용하여 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753) 각각으로부터 복수의 서브 음성 데이터(761, 762, 763)를 생성할 수 있다.The processor may generate a plurality of sub voice data 761 , 762 , and 763 from each of the plurality of post-processed sub mel-spectrograms 751 , 752 , and 753 . For example, the processor may generate a plurality of sub-voice data 761, 762, and 763 from each of the plurality of sub-mel-spectrograms 751, 752, and 753 post-processed using ISFT or Griffin-Lim algorithm. there is.

일 실시예에서, 프로세서는 후처리 되기 전의 복수의 서브 멜-스펙트로그램(631, 632, 633) 각각의 길이에 기초하여, 복수의 서브 음성 데이터(761, 762, 763) 각각에 대한 참조 구간(771, 772, 773)을 결정할 수 있다. In one embodiment, the processor determines the reference interval for each of the plurality of sub-speech data 761, 762, and 763 based on the length of each of the plurality of sub-mel-spectrograms 631, 632, and 633 before post-processing ( 771, 772, 773) can be determined.

예를 들어, 후처리된 복수의 서브 멜-스펙트로그램(751, 752, 753)은 모두 150의 길이를 갖지만, 제1 후처리된 서브 멜-스펙트로그램(751)에 대응하는 제1 서브 멜-스펙트로그램(631)은 132의 길이를 가지므로, 제1 후처리된 서브 멜-스펙트로그램(751)은 132의 길이까지만 유효한 데이터를 포함한다. 마찬가지 이유에서, 제3 후처리된 서브 멜-스펙트로그램(753)은 114의 길이까지만 유효한 데이터를 포함하는 반면, 제2 후처리된 서브 멜-스펙트로그램(752)의 경우 150의 길이 모두 유효한 데이터를 포함할 수 있다.For example, the plurality of post-processed sub-Mel-spectrograms 751, 752, and 753 all have a length of 150, but the first sub-MEL-spectrogram corresponding to the first post-processed sub-MEL-spectrogram 751 has a length of 150. Since the spectrogram 631 has a length of 132, the first post-processed sub-MEL-spectrogram 751 includes valid data only up to a length of 132. For the same reason, the third post-processed sub-Mel-spectrogram 753 contains valid data only up to a length of 114, while the second post-processed sub-Mel-spectrogram 752 contains valid data for all lengths of 150. can include

프로세서는 제1 후처리된 서브 멜-스펙트로그램(751)으로부터 생성된 제1 서브 음성 데이터(761)의 제1 참조 구간(771) 길이를 132로 결정하고, 제2 후처리된 서브 멜-스펙트로그램(752)으로부터 생성된 제2 서브 음성 데이터(762)의 제2 참조 구간(772) 길이를 150으로 결정하고, 제3 후처리된 서브 멜-스펙트로그램(753)으로부터 생성된 제3 서브 음성 데이터(763)의 제3 참조 구간(773) 길이를 114로 결정할 수 있다.The processor determines the length of the first reference interval 771 of the first sub-speech data 761 generated from the first post-processed sub-Mel-spectrogram 751 to be 132, and uses the second post-processed sub-Mel-spectrogram as The length of the second reference interval 772 of the second sub-speech data 762 generated from the second sub-speech data 762 is determined to be 150, and the third sub-speech generated from the third post-processed sub-mel-spectrogram 753 is determined. The length of the third reference section 773 of the data 763 may be determined to be 114.

프로세서는 제1 참조 구간(771), 제2 참조 구간(772) 그리고 제3 참조 구간(773)을 연결하여 음성 데이터(780)를 생성할 수 있다.The processor may generate voice data 780 by connecting the first reference interval 771 , the second reference interval 772 , and the third reference interval 773 .

프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이와 기설정된 홉 사이즈(hop size)에 기초하여, 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성할 수 있다. 구체적으로, 프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이와, 멜-스펙트로그램의 하나의 프레임이 커버하는 음성 데이터의 길이에 해당하는 홉 사이즈(예를 들어, 300)를 곱함으로써, 복수의 서브 음성 데이터(761, 762, 763) 각각에 대한 참조 구간(771, 772, 773)을 결정할 수 있다.The processor may generate voice data from a plurality of sub-mel-spectrograms based on a length of each of the plurality of sub-mel-spectrograms and a preset hop size. Specifically, the processor multiplies the length of each of the plurality of sub mel-spectrograms by a hop size (eg, 300) corresponding to the length of voice data covered by one frame of the mel-spectrogram, thereby obtaining a plurality of Reference intervals 771 , 772 , and 773 for each of the sub voice data 761 , 762 , and 763 may be determined.

한편, 도 5, 도 6a, 도 6b 및 도 7의 프로세서는 음성 합성 시스템의 보코더에 포함되는 하드웨어 및/또는 별도의 독립적인 하드웨어일 수 있다.Meanwhile, the processors of FIGS. 5, 6a, 6b, and 7 may be hardware included in a vocoder of a voice synthesis system and/or separate hardware.

도 8은 일 실시예에 따른 텍스트 시퀀스를 분할하는 예시를 설명하기 위한 도면이다.8 is a diagram for explaining an example of segmenting a text sequence according to an exemplary embodiment.

도 8에는 특정 자연 언어로 구성된 텍스트 시퀀스의 예가 도시된다. 또한, 텍스트 시퀀스에는 문장 부호가 포함된다. 예를 들어, 문장 부호는 '.', ',', '?', '!' 등일 수 있다.8 shows an example of a text sequence composed of a specific natural language. Also, text sequences include punctuation marks. For example, punctuation marks '.', ',', '?', '!' etc.

음성 합성 시스템(100, 200)은 텍스트 시퀀스에 포함된 문자들 및 문장 부호들을 확인한다. 그리고, 음성 합성 시스템(100, 200)은 텍스트 시퀀스에 포함된 문장 부호들이 미리 정해진 문장 부호인지 여부를 확인한다.The speech synthesis system 100 or 200 identifies characters and punctuation marks included in a text sequence. Then, the speech synthesis system 100 or 200 checks whether punctuation marks included in the text sequence are predetermined punctuation marks.

만약, 문장 부호들이 미리 정해진 문장 부호인 경우, 음성 합성 시스템(100, 200)은 텍스트 시퀀스를 문장 부호를 기준으로 분할할 수 있다. 예를 들어, 문장 부호 '.', ',', '?', '!'가 미리 정해진 문장 부호인 경우, 음성 합성 시스템(100, 200)은 텍스트 시퀀스를 기설정된 문장 부호를 기준으로 분할함으로써, 서브 시퀀스들을 생성할 수 있다.If the punctuation marks are predetermined punctuation marks, the speech synthesis system 100 or 200 may divide the text sequence based on the punctuation marks. For example, when the punctuation marks '.', ',', '?', and '!' are predetermined punctuation marks, the speech synthesis system 100 or 200 divides the text sequence based on the predetermined punctuation marks. , subsequences can be created.

도 9는 일 실시예에 따른 서브 멜-스펙트로그램 사이에 무음 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성하는 과정을 설명하기 위한 도면이다.9 is a diagram for explaining a process of generating a final mel-spectrogram by adding a silent mel-spectrogram between sub mel-spectrograms according to an embodiment.

도 9를 참조하면, 음성 합성 시스템(100, 200)은 텍스트 시퀀스(910)를 복수의 서브 시퀀스(911, 912, 913)로 분할할 수 있다. 텍스트 시퀀스(910)를 복수의 서브 시퀀스(911, 912, 913)로 분할하는 방식은 도 8에서 상술하였으므로, 여기서는 생략하기로 한다.Referring to FIG. 9 , the speech synthesis systems 100 and 200 may divide a text sequence 910 into a plurality of subsequences 911 , 912 and 913 . Since the method of dividing the text sequence 910 into a plurality of subsequences 911, 912, and 913 has been described in detail with reference to FIG. 8, it will be omitted here.

음성 합성 시스템(100, 200)은 복수의 서브 시퀀스(911, 912, 913)를 이용하여 복수의 서브 멜-스펙트로그램(921, 922, 923)을 생성할 수 있다. 구체적으로 음성 합성 시스템(100, 200)은 텍스트에 해당하는 복수의 서브 시퀀스(911, 912, 913)와 화자 정보에 기초하여 복수의 서브 멜-스펙트로그램(921, 922, 923)을 생성할 수 있다. 또한, 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923)으로부터 음성 데이터를 생성할 수 있다. 이에 대한 구체적인 내용은 도 1 내지 도 4에서 상술하였으므로, 여기서는 생략하기로 한다.The speech synthesis system 100 or 200 may generate a plurality of sub-Mel-spectrograms 921, 922, or 923 using the plurality of sub-sequences 911, 912, or 913. Specifically, the speech synthesis system 100 or 200 may generate a plurality of sub-mel-spectrograms 921, 922, or 923 based on a plurality of sub-sequences 911, 912, or 913 corresponding to text and speaker information. there is. In addition, the voice synthesis system 100 or 200 may generate voice data from the plurality of sub-mel-spectrograms 921 , 922 , or 923 . Since detailed information about this has been described above with reference to FIGS. 1 to 4 , it will be omitted here.

일 실시예에서 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923) 사이에 무음 멜-스펙트로그램(931, 932)을 추가하여 최종 멜-스펙트로그램(940)을 생성할 수 있다. 음성 합성 시스템(100, 200)은 최종 멜-스펙트로그램(940)으로부터 음성 데이터를 생성할 수 있다.In one embodiment, the speech synthesis system 100, 200 adds silence mel-spectrograms 931, 932 between a plurality of sub mel-spectrograms 921, 922, 923 to form a final mel-spectrogram 940. can create The voice synthesis system 100 or 200 may generate voice data from the final mel-spectrogram 940 .

구체적으로, 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923)에 대응하는 복수의 서브 시퀀스(911, 912, 913)(즉, 텍스트)의 마지막 문자를 식별할 수 있다. 마지막 문자가 제1 그룹 문자인 경우, 음성 합성 시스템(100, 200)은 서브 멜-스펙트로그램에 제1 시간을 갖는 무음 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성할 수 있다. 또한, 마지막 문자가 제2 그룹 문자인 경우, 음성 합성 시스템(100, 200)은 서브 멜-스펙트로그램에 제2 시간을 갖는 무음 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성할 수 있다.Specifically, the speech synthesis system 100, 200 identifies the last character of a plurality of subsequences 911, 912, and 913 (ie, text) corresponding to a plurality of submel-spectrograms 921, 922, and 923. can do. When the last character is a first group character, the speech synthesis system 100 or 200 may generate a final mel-spectrogram by adding a silent mel-spectrogram having a first time to the sub mel-spectrogram. In addition, when the last character is a second group character, the speech synthesis system 100, 200 may generate a final mel-spectrogram by adding a silent mel-spectrogram having a second time to the sub mel-spectrogram. .

예를 들어, 제1 그룹 문자는 짧은 휴지 기간에 해당하는 문자로써 ',' 및 ' '를 포함할 수 있다. 또한, 제2 그룹 문자는 긴 휴지 기간에 해당하는 문자로써 '.', '?', '!'를 포함할 수 있다. 이 경우, 제1 시간은 기준 시간으로 설정되고, 제2 시간은 기준 시간의 3배로 설정될 수 있다.For example, the first group characters may include ',' and ' ' as characters corresponding to a short rest period. Also, the second group characters may include '.', '?', and '!' as characters corresponding to a long pause period. In this case, the first time may be set to the reference time, and the second time may be set to three times the reference time.

한편, 음성 합성 시스템(100, 200)은 복수의 서브 시퀀스(911, 912, 913)의 문자를 2개 이상의 그룹으로 구분할 수 있고, 각 그룹에 대응하는 무음 멜-스펙트로그램의 시간도 상술한 예로 제한되지 않는다.On the other hand, the voice synthesis system 100, 200 may classify the characters of the plurality of subsequences 911, 912, and 913 into two or more groups, and the time of the silent mel-spectrogram corresponding to each group is also described as the above example. Not limited.

다른 실시예에서 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923) 사이에 숨소리 멜-스펙트로그램을 추가하여 최종 멜-스펙트로그램을 생성할 수 있다. 이를 위해, 음성 합성 시스템(100, 200)은 화자 정보로써 숨소리 데이터를 획득할 수 있다.In another embodiment, the speech synthesis system 100 or 200 may generate a final mel-spectrogram by adding a breath sound mel-spectrogram between the plurality of sub mel-spectrograms 921 , 922 , and 923 . To this end, the speech synthesis systems 100 and 200 may obtain breath sound data as speaker information.

다른 실시예에서 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923)에 대응하는 복수의 서브 시퀀스(911, 912, 913)(즉, 텍스트)의 마지막 문자를 식별할 수 있다. 마지막 문자가 제1 그룹 문자인 경우, 음성 합성 시스템(100, 200)은 서브 멜-스펙트로그램에 소정의 시간을 갖는 무음 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성할 수 있다. 또한, 마지막 문자가 제2 그룹 문자인 경우, 음성 합성 시스템(100, 200)은 서브 멜-스펙트로그램에 숨소리 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성할 수 있다. 예를 들어, 제1 그룹 문자는 짧은 휴지 기간에 해당하는 문자로써 ',' 및 ' '를 포함할 수 있다. 또한, 제2 그룹 문자는 긴 휴지 기간에 해당하는 문자로써 '.', '?', '!'를 포함할 수 있다.In another embodiment, the speech synthesis system 100, 200 converts the last character of a plurality of subsequences 911, 912, and 913 (ie, text) corresponding to the plurality of submel-spectrograms 921, 922, and 923. can be identified. When the last character is a first group character, the voice synthesis system 100 or 200 may generate a final mel-spectrogram by adding a silent mel-spectrogram having a predetermined time to the sub mel-spectrogram. In addition, when the last character is a second group character, the voice synthesis system 100 or 200 may generate a final mel-spectrogram by adding a breath sound mel-spectrogram to the sub mel-spectrogram. For example, the first group characters may include ',' and ' ' as characters corresponding to a short rest period. Also, the second group characters may include '.', '?', and '!' as characters corresponding to a long pause period.

다른 실시예에서 음성 합성 시스템(100, 200)은 복수의 서브 멜-스펙트로그램(921, 922, 923) 사이에 임의의 시간을 갖는 무음 멜-스펙트로그램을 추가함으로써 최종 멜-스펙트로그램을 생성할 수도 있다.In another embodiment, the speech synthesis system 100, 200 may generate a final Mel-spectrogram by adding silent Mel-spectrograms having a random time between the plurality of sub-Mel-spectrograms 921, 922, and 923. may be

도 10은 일 실시예에 따른 멜-스펙트로그램의 무음 부분을 결정하는 방법의 흐름도이다.10 is a flowchart of a method for determining a silent portion of a Mel-spectrogram according to an embodiment.

도 10을 참조하면, 단계 1010에서 프로세서는 화자 정보를 수신하고 화자 정보에 기초하여 화자 임베딩 벡터를 생성할 수 있다. Referring to FIG. 10 , in step 1010, the processor may receive speaker information and generate a speaker embedding vector based on the speaker information.

화자 정보는 화자의 음성 신호 또는 음성 샘플에 해당할 수 있다. 프로세서는 화자의 음성 신호 또는 음성 샘플을 수신하여, 화자의 발화 특징을 추출할 수 있으며 이를 임베딩 벡터로 나타낼 수 있다. Speaker information may correspond to a speaker's voice signal or voice sample. The processor may receive the speaker's voice signal or voice sample, extract the speaker's speech characteristics, and represent the speech characteristics as an embedding vector.

화자의 발화 특징은 발화 속도, 휴지 구간, 음높이, 음색, 운율, 억양 또는 감정 등 다양한 요소들 중 적어도 하나를 포함할 수 있다. 즉, 프로세서는 화자 정보에 포함된 불연속적인 데이터 값을 연속적인 숫자로 구성된 벡터로 나타낼 수 있다. 예를 들어, 프로세서는 pre-net, CBHG 모듈, DNN(Deep Neural Network), CNN(convolutional neural network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 화자 임베딩 벡터를 생성할 수 있다. 일 실시예에서 단계 1010은 도 2의 화자 인코더(210)에 의해 수행될 수 있다.The speaker's speech characteristics may include at least one of various factors such as speech speed, pause period, pitch, timbre, prosody, intonation, or emotion. That is, the processor may represent discontinuous data values included in speaker information as vectors composed of consecutive numbers. For example, the processor has pre-net, CBHG module, deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory network (LSTM), bidirectional recurrent deep neural network (BRDNN) A speaker embedding vector may be generated based on at least one or a combination of two or more of various artificial neural network models such as a network). In one embodiment, step 1010 may be performed by the speaker encoder 210 of FIG. 2 .

단계 1020에서 프로세서는 텍스트를 수신하여 텍스트에 기초하여 텍스트 임베딩 벡터를 생성할 수 있다.In step 1020, the processor may receive text and generate a text embedding vector based on the text.

텍스트는 특정 자연 언어로 된 문자들의 시퀀스를 포함할 수 있다. 예를 들어, 문자들의 시퀀스는 알파벳 문자들, 숫자들, 문장 부호들 또는 기타 특수 문자들을 포함할 수 있다. Text may contain sequences of characters in a particular natural language. For example, a sequence of characters may include alphabetic characters, numbers, punctuation marks, or other special characters.

프로세서는 입력된 텍스트를 자모 단위, 글자 단위 또는 음소 단위로 분리할 수 있고, 분리된 텍스트를 인공 신경망 모델에 입력할 수 있다. 예를 들어, 프로세서는 pre-net, CBHG 모듈, DNN, CNN, RNN, LSTM, BRDNN 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 텍스트 임베딩 벡터를 생성할 수 있다. The processor may separate input text into consonant units, character units, or phoneme units, and input the separated text into an artificial neural network model. For example, the processor may generate a text embedding vector based on at least one or a combination of two or more of various artificial neural network models such as pre-net, CBHG module, DNN, CNN, RNN, LSTM, and BRDNN.

또는, 프로세서는 입력된 텍스트를 복수의 짧은 텍스트들로 분리하고, 짧은 텍스트들 각각에 대하여 복수의 텍스트 임베딩 벡터들을 생성할 수도 있다. 일 실시예에서 단계 1020은 도 2의 합성기(220) 에 의해 수행될 수 있다.Alternatively, the processor may divide the input text into a plurality of short texts and generate a plurality of text embedding vectors for each of the short texts. In one embodiment, step 1020 may be performed by synthesizer 220 of FIG.

단계 1030에서 프로세서는 화자 임베딩 백터 및 텍스트 임베딩 벡터에 기초하여 멜-스펙트로그램을 생성할 수 있다.In step 1030, the processor may generate a mel-spectrogram based on the speaker embedding vector and the text embedding vector.

프로세서는 화자 임베딩 벡터 및 텍스트 임베딩 벡터를 입력으로 수신할 수 있다. 프로세서는 화자 임베딩 벡터와 텍스트 임베딩 벡터를 인공 신경망 모델에 입력하여, 입력된 텍스트에 대응되는 스펙트로그램을 생성할 수 있다. 즉, 프로세서는 화자의 발화 특징이 반영된 입력 텍스트에 대한 스펙트로그램을 생성할 수 있다. 예를 들면, 스펙트로그램은 멜 스펙트로그램(mel-spectrogram)에 해당할 수 있으나, 이에 제한되는 것은 아니다.The processor may receive a speaker embedding vector and a text embedding vector as inputs. The processor may generate a spectrogram corresponding to the input text by inputting the speaker embedding vector and the text embedding vector to the artificial neural network model. That is, the processor may generate a spectrogram of the input text in which the speech characteristics of the speaker are reflected. For example, the spectrogram may correspond to a mel-spectrogram, but is not limited thereto.

스펙트로그램은 음성 신호의 스펙트럼을 시각화하여 그래프로 표현한 것이다. 스펙트로그램의 x축은 시간, y축은 주파수를 나타내며 각 시간당 주파수가 가지는 값을 값의 크기에 따라 색으로 표현할 수 있다. 스펙토그램은 연속적으로 주어지는 음성 신호에 STFT(Short-time Fourier transform)를 수행한 결과물일 수 있다. 멜 스펙트로그램은 스펙트로그램의 주파수 간격을 멜 스케일(Mel Scale)로 재조정한 것이다. 일 실시예에서 단계 1030은 도 2의 합성기(220) 에 의해 수행될 수 있다.A spectrogram is a graph that visualizes the spectrum of a voice signal. The x-axis of the spectrogram represents time, and the y-axis represents frequency, and the value of each frequency per time can be expressed in color according to the size of the value. The spectogram may be a result of performing short-time Fourier transform (STFT) on a continuously given audio signal. The mel spectrogram is obtained by re-adjusting the frequency interval of the spectrogram to a mel scale. In one embodiment, step 1030 may be performed by synthesizer 220 of FIG.

단계 1040에서 프로세서는 멜-스펙트로그램에서 무음(silence) 부분을 결정할 수 있다.In step 1040, the processor may determine a silence part in the mel-spectrogram.

프로세서는 멜-스펙트로그램을 구성하는 복수의 프레임 각각에 대한 볼륨 값을 계산하여 볼륨 그래프를 생성할 수 있다. 프로세서는 복수의 프레임 중에서 볼륨 값이 제1 임계값 이하인 적어도 하나의 프레임을 제1 구간으로 선택할 수 있다. 일 실시예에서, 프로세서는 제1 구간을 멜-스펙트로그램의 무음 부분으로 결정할 수 있다.The processor may generate a volume graph by calculating a volume value for each of a plurality of frames constituting the mel-spectrogram. The processor may select at least one frame having a volume value equal to or less than a first threshold value from among the plurality of frames as the first interval. In one embodiment, the processor may determine the first period as a silent part of the Mel-spectrogram.

다른 실시예에서, 프로세서는 제1 구간 중에서, 프레임 개수가 제2 임계값 이상인 구간을 제2 구간으로 선택할 수 있다. 프로세서는 멜-스펙트로그램의 제2 구간을 무음 부분으로 결정할 수 있다. 예를 들어, 멜-스펙트로그램(520)으로 음성을 만들 때 보코더 중 하나인 WaveRNN에서 설정된 overlap 값 및 hop size에 기초하여 제2 임계값이 결정될 수 있다.In another embodiment, the processor may select, as the second period, a period in which the number of frames is greater than or equal to the second threshold among the first period. The processor may determine the second section of the mel-spectrogram as the silent part. For example, when a voice is generated using the mel-spectrogram 520, the second threshold may be determined based on an overlap value and a hop size set in WaveRNN, which is one of the vocoders.

단계 1050에서 프로세서는 무음 부분을 기준으로 멜-스펙트로그램을 복수의 서브 멜-스펙트로그램으로 분할할 수 있다.In step 1050, the processor may divide the mel-spectrogram into a plurality of sub mel-spectrograms based on the silent part.

프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이를 계산할 수 있다. 프로세서는 복수의 서브 멜-스펙트로그램의 길이가 기준 배치 길이와 동일해지도록, 복수의 서브 멜-스펙트로그램을 후처리할 수 있다. 일 실시예에서 기준 배치 길이는 기설정된 값일 수 있다. 다른 실시예에서 기준 배치 길이는 복수의 서브 멜-스펙트로그램 중에서 가장 긴 길이를 갖는 서브 멜-스펙트로그램의 길이로 설정될 수 있다.The processor may calculate the length of each of the plurality of sub mel-spectrograms. The processor may post-process the plurality of sub-MEL-spectrograms such that the length of the plurality of sub-MEL-spectrograms is equal to the reference batch length. In one embodiment, the reference batch length may be a preset value. In another embodiment, the reference batch length may be set to the length of a sub mel-spectrogram having the longest length among a plurality of sub mel-spectrograms.

프로세서는 후처리된 복수의 서브 멜-스펙트로그램으로부터 상기 음성 데이터를 생성할 수 있다. 복수의 서브 멜-스펙트로그램 각각의 길이가 기설정된 배치 길이와 동일해지도록, 프로세서는 기준 배치 길이 미만의 길이를 갖는 서브 멜-스펙트로그램에 대해 제로-패딩(zero-padding)을 적용함으로써 복수의 서브 멜-스펙트로그램을 후처리할 수 있다.The processor may generate the voice data from a plurality of post-processed sub-Mel-spectrograms. The processor applies zero-padding to sub-Mel-spectrograms having a length less than the reference batch length so that the length of each of the plurality of sub-MEL-spectrograms is equal to the preset batch length, thereby forming a plurality of sub-MEL-spectrograms. The sub-Mel-spectrogram can be post-processed.

단계 1060에서 프로세서는 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성할 수 있다.In step 1060, the processor may generate voice data from a plurality of sub-MEL-spectrograms.

프로세서는 후처리된 복수의 서브 멜-스펙트로그램 각각으로부터 복수의 서브 음성 데이터를 생성할 수 있다. 프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이에 기초하여, 복수의 서브 음성 데이터 각각에 대한 참조 구간을 결정할 수 있다.The processor may generate a plurality of sub voice data from each of the plurality of post-processed sub mel-spectrograms. The processor may determine a reference interval for each of the plurality of sub-speech data based on the length of each of the plurality of sub-Mel-spectrograms.

프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이와 기설정된 홉 사이즈(hop size)에 기초하여, 복수의 서브 멜-스펙트로그램으로부터 음성 데이터를 생성할 수 있다. 구체적으로, 프로세서는 복수의 서브 멜-스펙트로그램 각각의 길이와, 멜-스펙트로그램의 하나의 프레임이 커버하는 음성 데이터의 길이에 해당하는 홉 사이즈를 곱함으로써, 복수의 서브 음성 데이터 각각에 대한 참조 구간을 결정할 수 있다.The processor may generate voice data from a plurality of sub-mel-spectrograms based on a length of each of the plurality of sub-mel-spectrograms and a preset hop size. Specifically, the processor multiplies the length of each of the plurality of sub-mel-spectrograms by the hop size corresponding to the length of speech data covered by one frame of the mel-spectrogram, thereby obtaining a reference for each of the plurality of sub-speech data section can be determined.

프로세서는 참조 구간을 연결하여 음성 데이터를 생성할 수 있다.The processor may generate voice data by connecting the reference intervals.

본 명세서에서, "부"는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.In this specification, “unit” may be a hardware component such as a processor or a circuit, and/or a software component executed by a hardware component such as a processor.

전술한 본 명세서의 설명은 예시를 위한 것이며, 본 명세서의 내용이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of this specification is for illustrative purposes, and those skilled in the art to which the contents of this specification pertain will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. You will be able to. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 실시예의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 포함되는 것으로 해석되어야 한다.The scope of the present embodiment is indicated by the appended claims rather than the detailed description above, and should be construed as including all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof.

Claims

A method for generating voice data using speaker information and text,
generating a mel-spectrogram based on the speaker embedding vector and the text embedding vector;
determining a silence part in the mel-spectrogram;
dividing the mel-spectrogram into a plurality of sub mel-spectrograms based on the silent part; and
generating voice data from the plurality of sub mel-spectrograms;
Generating the voice data,
identifying the last character of text corresponding to the plurality of sub mel-spectrograms;
adding a silent mel-spectrogram to each of the plurality of sub mel-spectrograms based on the last character of the text;
post-processing the plurality of sub-mel-spectrograms to which the silent mel-spectrograms are added such that the length of each of the plurality of sub-mel-spectrograms to which the silent mel-spectrograms are added is equal to a reference batch length;
generating a plurality of sub voice data from each of the plurality of post-processed sub mel-spectrograms;
A reference interval for each of the plurality of sub-speech data based on the length of each of the plurality of sub-Mel-spectrograms before being post-processed - the reference interval is data valid in each of the plurality of sub-Mel-spectrograms that have been post-processed Is an interval included in - Determining that; and
generating the voice data by connecting the reference sections;
including,
Determining the silent part,
calculating a volume value for each of a plurality of frames constituting the mel-spectrogram;
selecting at least one frame having a volume value equal to or less than a first threshold value from among the plurality of frames as a first interval; and
determining the first section as a silent part;
Including, method.

According to claim 1,
The step of adding the silent mel-spectrogram,
If the last character is a first group character, add a silent mel-spectrogram having a first time to the sub mel-spectrogram;
if the last character is a second group character, generating a final mel-spectrogram by adding a silent mel-spectrogram having a second time to the sub mel-spectrogram;
Including, method.

delete