EP1526508B1

EP1526508B1 - Method for the selection of synthesis units

Info

Publication number: EP1526508B1
Application number: EP04105204A
Authority: EP
Inventors: François THALES Intellectual Property CAPMAN; Marc THALES Intellectual Property PADELLINI
Original assignee: Thales SA
Current assignee: Thales SA
Priority date: 2003-10-24
Filing date: 2004-10-21
Publication date: 2009-05-27
Anticipated expiration: 2024-10-21
Also published as: US8195463B2; US20050137871A1; ES2326646T3; EP1526508A1; DE602004021221D1; ATE432525T1; FR2861491B1; FR2861491A1

Abstract

The method involves determining a value of a mean fundamental frequency for information segment e.g. speech segment. A sub-set of synthesis units whose mean fundamental frequency values are closer to the fundamental frequency value, is selected. Multiple proximity criteria are applied to the selected synthesis units for determining a synthesis unit representative of the information segment. The proximity criteria includes the fundamental frequency or pitch, spectral distortion, and/or energy profile. An independent claim is also included for the utilization of a synthesis unit selecting method for the selection and/or coding of the synthesis units for a very low bandwidth voice coder.

Description

L'invention concerne un procédé de sélection d'unités de synthèse.The invention relates to a method for selecting synthesis units.

Elle concerne par exemple un procédé de sélection et de codage d'unités de synthèse pour un codeur de parole très bas débit, par exemple inférieur à 600 bits/sec.It relates for example to a method for selecting and coding synthesis units for a very low bit rate speech coder, for example less than 600 bits / sec.

Les techniques d'indexation d'unités de parole naturelle ont récemment permis le développement de systèmes de synthèse à partir du texte particulièrement performants. Ces techniques sont dorénavant étudiées dans le cadre du codage à très bas débit de la parole, conjointement avec des algorithmes empruntés au domaine de la reconnaissance vocale, Ref [1-5]. L'idée principale consiste à identifier dans le signal de parole à coder, une segmentation quasi optimale en unités élémentaires. Ces unités peuvent être des unités obtenues à partir d'une transcription phonétique, qui a l'inconvénient de devoir être corrigée manuellement pour un résultat optimal, ou de façon automatique selon des critères de stabilité spectrale. A partir de ce type de segmentation, et pour chacun des segments, on cherche l'unité de synthèse la plus proche dans un dictionnaire obtenu lors d'une phase d'apprentissage préalable, et contenant des unités de synthèse de référence.The techniques of indexing natural speech units have recently led to the development of highly efficient text synthesis systems. These techniques are now studied in the context of very low speech rate coding, together with algorithms borrowed from the field of speech recognition, Ref [1-5]. The main idea consists in identifying in the speech signal to be coded, an almost optimal segmentation into elementary units. These units can be units obtained from a phonetic transcription, which has the disadvantage of having to be corrected manually for an optimal result, or automatically according to spectral stability criteria. From this type of segmentation, and for each segment, we look for the closest synthesis unit in a dictionary obtained during a preliminary learning phase, and containing reference synthesis units.

Le schéma de codage utilisé consiste à modéliser l'espace acoustique du locuteur (ou des locuteurs) par des modèles de Markov cachés (HMM ou Hidden Markov Models). Ces modèles dépendants ou indépendants du locuteur sont obtenus lors d'une phase d'apprentissage préalable à partir d'algorithmes identiques à ceux mis en oeuvre dans les systèmes de reconnaissance de la parole. La différence essentielle réside dans le fait que les modèles sont appris sur des vecteurs regroupés par classes de façon automatique et non de manière supervisée à partir d'une transcription phonétique. La procédure d'apprentissage consiste alors à obtenir de façon automatique la segmentation des signaux d'apprentissage (par exemple en utilisant la méthode dite de décomposition temporelle), à regrouper les segments obtenus dans un nombre fini de classes correspondant au nombre de modèles HMM que l'on souhaite construire. Le nombre de modèles est directement lié à la résolution recherchée pour représenter l'espace acoustique du ou des locuteurs. Une fois obtenus, ces modèles permettent de segmenter le signal à coder en utilisant un algorithme de Viterbi. La segmentation permet d'associer à chaque segment, l'indice de classe et sa longueur. Cette information n'étant pas suffisante pour modéliser l'information spectrale, pour chacune des classes une réalisation de trajectoire spectrale est sélectionnée parmi plusieurs unités dites de synthèse. Ces unités sont extraites de la base d'apprentissage lors de sa segmentation utilisant les modèles HMM. Il est possible de tenir compte du contexte par exemple en utilisant plusieurs sous-classes permettant de prendre en compte les transitions d'une classe vers l'autre. Un premier indice indique la classe à laquelle appartient le segment considéré, un deuxième indice précise la sous-classe à laquelle il appartient comme étant l'indice de classe du segment précédent. L'indice de sous-classe n'est donc pas à transmettre, et l'indice de classe doit être mémorisé pour le segment suivant. Les sous-classes ainsi définies permettent de tenir compte des différentes transitions vers la classe associée au segment considéré. A l'information spectrale on ajoute l'information de prosodie, c'est-à-dire la valeur des paramètres de pitch et d'énergie et leurs évolutions.The coding scheme used consists of modeling the acoustic space of the speaker (or speakers) by Hidden Markov Models (HMM). These dependent or speaker-independent models are obtained during a prior learning phase from algorithms identical to those used in speech recognition systems. The essential difference lies in the fact that the models are learned on vectors grouped by classes in an automatic way and not in a supervised way from a phonetic transcription. The learning procedure then consists of automatically obtaining the segmentation of the learning signals (for example by using the so-called temporal decomposition method), group the segments obtained in a finite number of classes corresponding to the number of HMM models that one wishes to build. The number of models is directly related to the resolution sought to represent the acoustic space of the speaker or speakers. Once obtained, these models make it possible to segment the signal to be encoded by using a Viterbi algorithm. Segmentation allows to associate to each segment, the class index and its length. This information is not sufficient to model the spectral information, for each of the classes a spectral trajectory embodiment is selected from several so-called synthesis units. These units are extracted from the learning base during its segmentation using the HMM models. It is possible to take into account the context for example by using several subclasses to take into account the transitions from one class to another. A first index indicates the class to which belongs the segment considered, a second index specifies the subclass to which it belongs as being the class index of the preceding segment. The subclass index is therefore not to be transmitted, and the class index must be memorized for the next segment. The subclasses thus defined make it possible to take into account the different transitions to the class associated with the segment in question. To the spectral information is added the information of prosody, that is to say the value of the parameters of pitch and energy and their evolutions.

Dans l'optique de réaliser un codeur très bas débit, il est nécessaire d'optimiser l'allocation des bits et donc du débit entre les paramètres associés à l'enveloppe spectrale et à l'information de prosodie. Une méthode classique consiste dans un premier temps à sélectionner l'unité la plus proche d'un point de vue spectral puis, une fois l'unité sélectionnée, à coder l'information de prosodie, soit de façon indépendante de l'unité sélectionnée.In order to achieve a very low bit rate encoder, it is necessary to optimize the bit allocation and therefore the bit rate between the parameters associated with the spectral envelope and the prosody information. A conventional method consists first of all in selecting the closest unit from a spectral point of view and, once the unit has been selected, in coding the prosody information, either independently of the selected unit.

Le document intitulé « codage de la parole à très bas débit par indexation d'unités de taille variable » septembre 2003, rencontre des jeunes chercheurs en parole décrit un procédé dans lequel l'idée principale consiste à segmenter le signal de parole aux frontières de ce segment pour constituer des unités de synthèse. Une fois classées et regroupées dans une base de données, le signal de parole peut être caractérisé par une suite d'index faisant référence aux unités de la base. La base de données correspond aux données d'apprentissage. Afin de caractériser les variations d'intonation, la prosodie est extraite du signal à coder et est transmise avec les indexThe document entitled "Very Low Rate Speech Coding by Variable Unit Indexing", September 2003, Young Spoken Word Interview describes a process in which the main idea is to segment the speech signal at the boundaries of this segment to constitute synthesis units. Once classified and grouped in a database, the speech signal can be characterized by a series of indexes referring to the units of the database. The database corresponds to the training data. In order to characterize variations of intonation, the prosody is extracted from the signal to be coded and is transmitted with the indexes

Le document « Codage de la parole à très bas débit par indexation d'unités de taille variable" C.Baverel, P.Gournay, F.Capman, G.Chollet IST OTAN, 2001 concerne une démonstration de concept de la concaténation d'unités.The document "Coding of very low bit rate speech by indexing units of variable size" C.Baverel, P.Gournay, F.Capman, G.Chollet IST NATO, 2001 concerns a proof of concept of the concatenation of units .

Le procédé selon la présente invention propose une nouvelle méthode de sélection de l'unité de synthèse la plus proche conjointement à la modélisation et à la quantification des informations supplémentaires nécessaires au niveau du décodeur pour la restitution du signal de parole.The method according to the present invention proposes a new method of selecting the closest synthesis unit together with the modeling and quantization of the additional information required at the decoder for the reproduction of the speech signal.

L'invention concerne un procédé de sélection d'unités de synthèse d'une information d'une information se présentant sous la forme d'un segment de parole à coder et pouvant être décomposée en unités de synthèse. II comporte au moins les étapes suivantes :

pour un segment d'information considéré :
- déterminer la valeur F0 de la fréquence fondamentale moyenne pour le segment d'information considéré,
- sélectionner un sous-ensemble d'unités de synthèse défini comme étant celui dont les valeurs moyennes de pitch sont les plus proches de la valeur de pitch F0,
- appliquer un ou plusieurs critères de proximité aux unités de synthèse sélectionnées pour déterminer une unité de synthèse représentative du segment d'information.

The invention relates to a method for selecting synthesis units of information information in the form of a speech segment to be encoded and can be decomposed into synthesis units. It comprises at least the following steps:

for a segment of information considered:
- determine the value F0 of the average fundamental frequency for the segment of information considered,
- selecting a subset of synthesis units defined as being the one whose average pitch values are closest to the pitch value F0,
- applying one or more proximity criteria to the selected synthesis units to determine a summary unit representative of the information segment.

Selon une variante on utilise comme critères de proximité la fréquence fondamentale ou pitch, ou la distorsion spectrale, et/ou le profil d'énergie et on exécute une étape de fusion des critères utilisés afin de déterminer l'unité de synthèse représentative.According to one variant, the fundamental frequency or pitch, or the spectral distortion, and / or the profile are used as proximity criteria. of energy and performs a step of merging the criteria used to determine the representative synthesis unit.

Le procédé comporte par exemple une étape de codage et/ou une étape de correction du pitch par modification du profil de synthèse.The method comprises for example a coding step and / or a pitch correction step by modifying the synthesis profile.

L'étape de codage et/ou correction du pitch peut être une transformation linéaire du profil du pitch d'origine.The step of encoding and / or correcting the pitch may be a linear transformation of the pitch profile of origin.

Le procédé est par exemple utilisé pour la sélection et/ou le codage d'unités de synthèse pour un codeur de parole très bas débit.The method is for example used for the selection and / or coding of synthesis units for a very low bit rate speech coder.

L'invention présente notamment les avantages suivants :

le procédé permet d'optimiser le débit alloué à l'information de prosodie dans le domaine de la parole.
il permet de conserver, lors de la phase de codage, l'intégralité des unités de synthèse déterminées lors de la phase d'apprentissage avec cependant un nombre de bits constant pour coder l'unité de synthèse.
Dans un schéma de codage indépendant du locuteur, ce procédé offre la possibilité de couvrir l'ensemble des valeurs de pitch possibles (ou fréquences fondamentales) et de sélectionner l'unité de synthèse en tenant compte en partie des caractéristiques du locuteur.
La sélection peut s'appliquer à tout système basé sur une sélection d'unités et donc aussi à un système de synthèse à partir du texte.

The invention particularly has the following advantages:

the method optimizes the bit rate allocated to prosody information in the speech domain.
it makes it possible to retain, during the coding phase, the entirety of the synthesis units determined during the learning phase, with however a constant number of bits for coding the synthesis unit.
In a speaker-independent coding scheme, this method offers the possibility of covering all of the possible pitch values (or fundamental frequencies) and of selecting the synthesis unit taking into account in part the characteristics of the speaker.
The selection can be applied to any system based on a selection of units and therefore also to a system of synthesis from the text.

D'autres caractéristiques et avantages de l'invention apparaîtront mieux à la lecture de la description qui suit d'un exemple de réalisation non limitatif annexé des figures qui représentent :

La figure 1 un schéma de principe de sélection de l'unité de synthèse associée au segment d'information à coder,
La figure 2 un schéma de principe d'estimation des critères de similarité pour le profil du pitch,
La figure 3 un schéma de principe d'estimation des critères de similarité pour le profil énergétique,
La figure 4 un schéma de principe d'estimation des critères de similarité pour l'enveloppe spectrale,
La figure 5 un schéma de principe du codage du pitch par correction du profil de pitch de synthèse.

Other features and advantages of the invention will appear better on reading the following description of an exemplary non-limiting embodiment of the appended figures which represent:

The figure 1 a block diagram for selecting the synthesis unit associated with the information segment to be coded,
The figure 2 a schematic diagram of estimation of the similarity criteria for the pitch profile,
The figure 3 a scheme of principle for estimating similarity criteria for the energy profile,
The figure 4 a scheme of principle for estimating the similarity criteria for the spectral envelope,
The figure 5 a basic diagram of pitch coding by correction of the synthesis pitch profile.

Afin de mieux faire comprendre l'idée mise en oeuvre dans la présente l'invention, l'exemple qui suit est donné à titre illustratif et nullement limitatif pour un procédé mis en oeuvre dans un vocodeur, en particulier la sélection et le codage d'unités de synthèse pour un codeur de parole très bas débit.In order to better understand the idea implemented in the present invention, the following example is given by way of non-limiting illustration for a method implemented in a vocoder, in particular the selection and coding of synthesis units for a very low bit rate speech coder.

Pour rappel, au niveau d'un vocodeur, le signal de parole est analysé trame à trame afin d'extraire les paramètres caractéristiques (paramètres spectraux, pitch, énergie). Cette analyse se fait classiquement à l'aide d'une fenêtre glissante définie sur l'horizon de la trame. Cette trame a une durée de l'ordre de 20 ms, et la mise à jour se fait avec un décalage de la fenêtre d'analyse de l'ordre de 10ms à 20 ms.As a reminder, at the level of a vocoder, the speech signal is analyzed frame to frame in order to extract the characteristic parameters (spectral parameters, pitch, energy). This analysis is done conventionally using a sliding window defined on the horizon of the frame. This frame has a duration of the order of 20 ms, and the update is done with an offset of the analysis window of the order of 10 ms to 20 ms.

Lors d'une phase d'apprentissage, un ensemble de modèles de Markov cachés (HMM, Hidden Markov Model) sont appris. Ils permettent de modéliser des segments de parole (ensemble de trames successives) pouvant être associés à des phonèmes si la phase d'apprentissage est supervisée (segmentation et transcription phonétique disponibles) ou à des sons spectralement stables dans le cas d'une segmentation obtenue de façon automatique. On utilise ici 64 modèles HMM, qui permettent lors de la phase de reconnaissance d'associer à chaque segment l'indice du modèle HMM identifié, et donc la classe à laquelle il appartient. Les modèles HMM servent aussi à l'aide d'un algorithme de type Viterbi à réaliser lors de la phase de codage la segmentation et la classification de chacun des segments (appartenance à une classe). Chaque segment est donc identifié par un indice compris entre 1 et 64 qui est transmis au décodeur.During a learning phase, a set of Hidden Markov Models (HMM) are learned. They make it possible to model speech segments (set of successive frames) that can be associated with phonemes if the learning phase is supervised (available phonetic segmentation and transcription) or spectrally stable sounds in the case of a segmentation obtained from Automatic way. We use here 64 HMM models, which allow during the recognition phase to associate to each segment the index of the identified HMM model, and therefore the class to which it belongs. The HMM models are also used with the help of a Viterbi algorithm to perform the segmentation and classification of each segment (class membership) during the coding phase. Each segment is therefore identified by an index between 1 and 64 which is transmitted to the decoder.

Le décodeur utilise cet indice pour retrouver l'unité de synthèse dans le dictionnaire construit lors de la phase d'apprentissage. Les unités de synthèse qui constituent le dictionnaire sont simplement les séquences de paramètres associés aux segments obtenus sur le corpus d'apprentissage.The decoder uses this index to find the synthesis unit in the dictionary built during the learning phase. The synthesis units that make up the dictionary are simply the sequences of parameters associated with the segments obtained on the training corpus.

Une classe du dictionnaire contient l'ensemble des unités associées à un même modèle HMM. Chaque unité de synthèse est donc caractérisée par une séquence de paramètres spectraux, une séquence de valeur de pitch (profil de pitch), une séquence de gains (profil énergétique).A dictionary class contains all the units associated with the same HMM model. Each synthesis unit is therefore characterized by a sequence of spectral parameters, a pitch value sequence (pitch profile), a gain sequence (energy profile).

Afin d'améliorer la qualité de la synthèse, chaque classe (de 1 à 64) du dictionnaire est subdivisée en 64 sous-classes, où chaque sous-classe contient les unités de synthèse qui sont précédées temporellement par un segment appartenant à une même classe. Cette approche permet de tenir compte du contexte passé, et donc d'améliorer la restitution des zones transitoires d'une unité vers l'autre.In order to improve the quality of the synthesis, each class (from 1 to 64) of the dictionary is subdivided into 64 subclasses, where each subclass contains the synthesis units that are preceded temporally by a segment belonging to the same class . This approach allows take into account the past context, and thus improve the return of transitional zones from one unit to another.

La présente invention concerne notamment un procédé de sélection d'une unité de synthèse multicritères. Le procédé permet par exemple de tenir compte simultanément du pitch, de la distorsion spectrale, et des profils d'évolution du pitch et de l'énergie.The present invention relates in particular to a method for selecting a multicriterion synthesis unit. The method makes it possible, for example, to simultaneously take into account the pitch, the spectral distortion, and the pitch and energy evolution profiles.

Le procédé de sélection pour un segment de parole à coder comporte par exemple les étapes de sélection schématisées à la figure 1 :

1) Extraire le pitch moyen F₀ (fréquence fondamentale moyenne) sur le segment à coder composé de plusieurs trames. Le pitch est par exemple calculé pour chaque trame T, les erreurs de pitch sont corrigées en tenant compte de l'ensemble du segment afin d'éliminer les erreurs de détection voisé/non voisé, et le pitch moyen est calculé sur l'ensemble des trames voisées du segment.
Il est possible de représenter le pitch sur 5 bits, en utilisant par exemple un quantificateur non uniforme (compression logarithmique) appliqué à la période de pitch.
La valeur du pitch de référence est par exemple obtenue à partir d'un générateur de prosodie dans le cas d'une application en synthèse.
2) la valeur de pitch moyen F₀ étant ainsi quantifiée, sélectionner un sous-ensemble d'unités de synthèse SE dans la sous-classe considérée. Le sous-ensemble est défini comme étant celui dont les valeurs moyennes de pitch sont les plus proches de la valeur de pitch F₀.
Dans la configuration précédente cela conduit à retenir de façon systématique les 32 unités les plus proches selon le critère du pitch moyen. Il est donc possible de retrouver ces unités au niveau du décodeur à partir du pitch moyen transmis.
3) Parmi les unités de synthèse ainsi sélectionnées, appliquer un ou plusieurs critères de proximité ou de similarité, par exemple le critère de distorsion spectrale, et/ou le critère de profil d'énergie et/ou le critère de pitch pour déterminer l'unité de synthèse.
Lorsque l'on utilise plusieurs critères, une étape de fusion 3b) est réalisée pour prendre la décision. L'étape de fusion des différents critères est réalisée par combinaison linéaire ou non-linéaire. Les paramètres utilisés pour réaliser cette combinaison peuvent être obtenus par exemple sur un corpus d'apprentissage en minimisant un critère de distorsion spectrale sur le signal re-synthétisé. Ce critère de distorsion peut avantageusement inclure une pondération perceptuelle soit au niveau des paramètres spectraux utilisés, soit au niveau de la mesure de distorsion. Dans le cas d'une loi de pondération non linéaire il est possible d'utiliser un réseau connexionniste (MLP, Multi Layer Perceptron par exemple), de la logique floue, ou une autre technique.
4) Etape de codage du pitch

The selection method for a speech segment to be encoded comprises, for example, the selection steps schematized in FIG. figure 1 :

1) Extract the average pitch F ₀ (average fundamental frequency) on the segment to be encoded composed of several frames. The pitch is for example calculated for each frame T, the pitch errors are corrected taking into account the entire segment in order to eliminate the voiced / unvoiced detection errors, and the average pitch is calculated on all voiced frames of the segment.
It is possible to represent the pitch on 5 bits, using for example a non-uniform quantizer (logarithmic compression) applied to the pitch period.
The value of the reference pitch is for example obtained from a prosody generator in the case of an application in synthesis.
2) the average pitch value F ₀ is thus quantized, select a subset of synthesis units SE in the subclass considered. The subset is defined as the one whose average pitch values are closest to the pitch value F ₀ .
In the previous configuration this leads to systematically retain the 32 closest units according to the average pitch criterion. It is therefore possible to find these units at the decoder from the average pitch transmitted.
3) Among the synthesis units thus selected, apply one or more criteria of proximity or similarity, for example the criterion of spectral distortion, and / or the energy profile criterion and / or the pitch criterion for determining the synthesis unit.
When using several criteria, a merger step 3b) is performed to make the decision. The melting step of the different criteria is performed by linear or non-linear combination. The parameters used to achieve this combination can be obtained for example on a training corpus by minimizing a spectral distortion criterion on the re-synthesized signal. This distortion criterion may advantageously include a perceptual weighting either at the level of the spectral parameters used or at the level of the distortion measurement. In the case of a nonlinear weighting law it is possible to use a connectionist network (MLP, Multi Layer Perceptron for example), fuzzy logic, or another technique.
4) Step coding the pitch

Le procédé peut comporter dans une variante de réalisation une étape de codage de pitch par correction du profil de pitch de synthèse exposée en détail ci-après.The method may comprise, in a variant embodiment, a pitch coding step by correcting the synthesis pitch profile explained in detail below.

Le critère relatif au profil d'évolution du pitch permet en partie de tenir compte de l'information de voisement. Il est cependant possible de le désactiver lorsque le segment est totalement non voisé, ou que la sous-classe sélectionnée est aussi non voisée. En effet, on peut remarquer principalement trois types de sous-classes : les sous-classes contenant majoritairement des unités voisées, celles contenant majoritairement des unités non voisées, et les sous-classes contenant majoritairement des unités mixtes.The criterion relating to the pitch evolution profile partly makes it possible to take account of the voicing information. However, it can be disabled when the segment is completely unvoiced, or the selected subclass is unvoiced. Indeed, we can notice mainly three types of subclasses: the subclasses containing mostly voiced units, those containing mainly unvoiced units, and the subclasses containing mainly mixed units.

Le procédé selon l'invention ne se limite pas à optimiser le débit alloué à l'information de prosodie mais permet aussi de conserver pour la phase de codage l'intégralité des unités de synthèse obtenues lors de la phase d'apprentissage avec un nombre de bits constant pour coder l'unité de synthèse. En effet l'unité de synthèse est caractérisée à la fois par la valeur de pitch et par son indice. Cette approche permet dans un schéma de codage indépendant du locuteur de couvrir l'ensemble des valeurs de pitch possibles et de sélectionner l'unité de synthèse en tenant compte en partie des caractéristiques du locuteur, il existe en effet pour un même locuteur une corrélation entre la plage de variation du pitch et les caractéristiques du conduit vocal (en particulier la longueur).The method according to the invention is not limited to optimizing the bit rate allocated to the prosody information but also makes it possible to keep for the coding phase the entirety of the synthesis units obtained during the learning phase with a number of constant bits to encode the unit of synthesis. Indeed, the synthesis unit is characterized by both the pitch value and its index. This approach makes it possible, in a speaker-independent coding scheme, to cover all the possible pitch values and to select the synthesis unit, taking into account, in part, the characteristics of the speaker. In fact, for the same speaker there is a correlation between the range of variation of the pitch and the characteristics of the vocal tract (in particular the length).

On peut remarquer que le principe de sélection d'unités décrit peut s'appliquer à tout système dont le fonctionnement est basé sur une sélection d'unités et donc aussi à un système de synthèse à partir du texte.It may be noted that the described unit selection principle can be applied to any system whose operation is based on a selection of units and therefore also to a system of synthesis from the text.

La figure 2 schématise un principe d'estimation des critères de similarité pour le profil du pitch.
Le procédé comporte par exemple les étapes suivantes :

A1) sélectionner dans la sous-classe identifiée du dictionnaire des unités de synthèse et à partir de la valeur moyenne du pitch, les N unités les plus proches au sens du critère du pitch moyen. La suite du traitement se fait alors sur les profils de pitch associés à ces N unités. Le pitch est extrait lors de la phase d'apprentissage sur les unités de synthèse, et lors de la phase de codage sur le signal à coder. Les méthodes possibles pour l'extraction du pitch sont nombreuses, cependant les méthodes hybrides, combinant un critère temporel (AMDF, Average Magnitude Difference Function, ou autocorrélation normalisée) et un critère fréquentiel (HPS, Harmonic Power Sum, structure en peigne, ...) sont potentiellement plus robustes.
A2) aligner temporellement les N profils avce celui du segment à coder, par exemple par interpolation linéaire des N profils. Il est possible d'utiliser une technique d'alignement plus optimale basée sur un algorithme de programmation dynamique (DTW ou Dynamic Time Warping). L'algorithme s'applique sur les paramètres spectraux, les autres paramètres pitch, énergie, etc sont alignés de manière synchrone aux paramètres spectraux. Dans ce cas il faut transmettre les informations relatives au chemin d'alignement.
A3) calculer N mesures de similarités, entre les N profils de pitch alignés et le profil de pitch du segment de parole à coder pour obtenir les N coefficients de similarité {rp(1), rp(2), ....rp(N)}. Cette étape peut être réalisée au moyen d'une intercorrélation normalisée.
L'alignement temporel peut être un alignement par ajustement simple des longueurs (interpolation linéaire des paramètres). L'utilisation d'une simple correction des longueurs des unités de synthèse permet notamment de ne pas transmettre d'information relative au chemin d'alignement, le chemin d'alignement étant partiellement pris en compte par les corrélations des profils de pitch et d'énergie.
Dans le cas de segments mixtes (co-existence au sein d'un même segment de trames voisées et non voisées), l'utilisation des trames non voisées pour lesquelles le pitch est arbitrairement positionné à zéro permet de tenir compte dans une certaine mesure de l'évolution du voisement.
La figure 3 schématise le principe d'estimation des critères de similarité pour le profil énergétique.
Le procédé comporte par exemple les étapes suivantes :
A4) extraire les profils d'évolution de l'énergie pour les N unités sélectionnées comme indiqué précédemment, c'est-à-dire selon un critère de proximité du pitch moyen. Selon la technique de synthèse utilisée, le paramètre d'énergie utilisé peut soit correspondre à un gain (associé à un filtre de type LPC par exemple) ou une énergie (l'énergie calculée sur la structure harmonique dans le cas d'une modélisation harmonique/stochastique du signal). Enfin, l'estimation de l'énergie peut avantageusement se faire de manière synchrone du pitch (1 valeur d'énergie par période de pitch). Les profils énergétiques sont pré-calculés pour les unités de synthèse lors de la phase d'apprentissage.
A5) aligner temporellement les N profils avec celui des segments à coder, par exemple par interpolation linéaire, ou par programmation dynamique (alignement non-linéaire) de façon similaire à la méthode mise en oeuvre pour corriger le pitch.
A6) calculer N mesures de similarités, entre les profils des N valeurs d'énergie alignées et le profil d'énergie du segment de parole à coder pour obtenir les N coefficients de similarité {re(1), re(2), ...., re(N)}. Cette étape peut aussi être réalisée au moyen d'une intercorrélation normalisée.
La figure 4 schématise le principe d'estimation des critères de similarité pour l'enveloppe spectrale.
Le procédé comporte les étapes suivantes :
A7) aligner temporellement les N profils,
A8) déterminer les profils d'évolution des paramètres spectraux pour les N unités sélectionnées comme indiqué précédemment, c'est-à-dire selon un critère de proximité du pitch moyen. Il s'agit ici tout simplement de calculer le pitch moyen du segment à coder, et de considérer les unités de synthèse de la sous-classe associée (indice HMM courant pour définir la classe, indice HMM précédent pour définir la sous-classe) qui ont un pitch moyen proche.
A9) calculer N mesures de similarités, entre la séquence spectrale du segment à coder et les N séquences spectrales extraites des unités de synthèse sélectionnées pour obtenir les N coefficients de similarité {rs(1), rs(2), ...., rs(N)}. Cette étape peut être réalisée au moyen d'une intercorrélation normalisée.

The figure 2 schematizes a principle of estimation of similarity criteria for the pitch profile.
The method comprises for example the following steps:

A1) select in the identified subclass of the dictionary of the synthesis units and from the average value of the pitch, the N units closest to the meaning of the average pitch criterion. Further processing is then done on the pitch profiles associated with these N units. The pitch is extracted during the learning phase on the synthesis units, and during the coding phase on the signal to be coded. The possible methods for the extraction of the pitch are numerous, however the hybrid methods, combining a temporal criterion (AMDF, Average Magnitude Difference Function, or standard autocorrelation) and a frequency criterion (HPS, Harmonic Power Sum, comb structure, .. .) are potentially more robust.
A2) align the N profiles temporally with that of the segment to be coded, for example by linear interpolation of the N profiles. It is possible to use a more optimal alignment technique based on a dynamic programming algorithm (DTW or Dynamic Time Warping). The algorithm applies to the spectral parameters, the other parameters pitch, energy, etc. are aligned synchronously with the spectral parameters. In this case it is necessary to transmit the information relating to the alignment path.
A3) calculating N similarity measures between the N aligned pitch profiles and the pitch profile of the speech segment to be encoded to obtain the N similarity coefficients {rp (1), rp (2), .... rp ( NOT)}. This step can be performed by means of standardized intercorrelation.
The time alignment can be a simple adjustment of the lengths (linear interpolation of the parameters). The use of a simple correction of the lengths of the synthesis units notably makes it possible not to transmit information relating to the alignment path, the alignment path being partially taken into account by the correlations of the pitch profiles and energy.
In the case of mixed segments (coexistence within the same segment of voiced and unvoiced frames), the use of unvoiced frames for which the pitch is arbitrarily set to zero makes it possible to take account to a certain extent of the evolution of voicing.
The figure 3 schematizes the principle of estimating similarity criteria for the energy profile.
The method comprises for example the following steps:
A4) extract the evolution profiles of the energy for the N units selected as indicated above, that is to say according to a criterion of proximity of the average pitch. According to the synthesis technique used, the energy parameter used can either correspond to a gain (associated with an LPC type filter for example) or an energy (the energy calculated on the harmonic structure in the case of harmonic modeling / stochastic signal). Finally, the energy estimate can advantageously be synchronously pitch (1 energy value per pitch period). The energy profiles are pre-calculated for the synthesis units during the learning phase.
A5) temporally align the N profiles with that of the segments to be encoded, for example by linear interpolation, or by dynamic programming (nonlinear alignment) similarly to the method implemented to correct the pitch.
A6) calculating N similarity measures between the profiles of the N aligned energy values and the energy profile of the speech segment to be coded to obtain the N similarity coefficients {re (1), re (2), .. .., re (N)}. This step can also be performed by means of standardized cross correlation.
The figure 4 schematizes the principle of estimating similarity criteria for the spectral envelope.
The method comprises the following steps:
A7) temporally align the N profiles,
A8) determine the evolution profiles of the spectral parameters for the N units selected as indicated above, that is to say according to a criterion of proximity of the average pitch. It is simply a matter of calculating the average pitch of the segment to be coded, and of considering the synthesis units of the associated subclass (current HMM index to define the class, previous HMM index to define the subclass) which have a near average pitch.
A9) calculating N similarity measures, between the spectral sequence of the segment to be coded and the N spectral sequences extracted from the selected synthesis units to obtain the N coefficients of similarity {rs (1), rs (2), ...., rs (N)}. This step can be performed by means of standardized intercorrelation.

La mesure de similarité peut être une distance spectrale.The similarity measure may be a spectral distance.

L'étape A9) comprend par exemple une étape où l'on moyenne l'ensemble des spectres d'un même segment et la mesure de similarité est une mesure d'intercorrélation.Step A9) comprises, for example, a step where all the spectra of the same segment are averaged and the similarity measure is a cross-correlation measurement.

Le critère de distorsion spectrale est par exemple calculé sur des structures harmoniques ré-échantillonnées à pitch constant ou ré-échantillonnées au pitch du segment à coder, après interpolation des structures harmoniques initiales.The spectral distortion criterion is for example calculated on harmonic structures resampled to constant pitch or resampled to the pitch of the segment to be coded, after interpolation of the initial harmonic structures.

Le critère de similarité va dépendre des paramètres spectraux utilisés (par exemple du type de paramètres utilisés pour la représentation de l'enveloppe). Plusieurs types de paramètres spectraux peuvent être utilisés, dans la mesure où ils permettent de définir une mesure de distorsion spectrale. Dans le domaine du codage de la parole, il est courant d'utiliser les paramètres LSP ou LSF (LSP, Line Spectral Pair, LSF, Line Spectral Frequencies) dérivés d'une analyse par prédiction linéaire. Dans le domaine de la reconnaissance vocale, les paramètres cepstraux sont généralement utilisés, et ils peuvent soit être dérivés d'une analyse par prédiction linéaire (LPCC, Linear Prediction Cepstrum Coefficients) ou estimés à partir d'un banc de filtres souvent sur une échelle perceptuelle de type Mel ou Bark (MFCC, Mel Frequency Cepstrum Coefficients). Il est aussi possible dans la mesure où on utilise une modélisation sinusoïdale de la composante harmonique du signal de parole, d'utiliser directement les amplitudes des fréquences harmoniques. Ces derniers paramètres étant estimés en fonction du pitch ne peuvent être utilisés directement pour calculer une distance. Le nombre de coefficients obtenus est en effet variable en fonction du pitch, contrairement aux paramètres LPCC, MFCC ou LSF. Un pré-traitement consiste alors à estimer une enveloppe spectrale à partir des amplitudes harmoniques (interpolation linéaire ou polynomiale de type spline) et à réechantillonner l'enveloppe ainsi obtenue, soit en utilisant la fréquence fondamentale du segment à coder, soit en utilisant une fréquence fondamentale constante (100 Hz par exemple). Une fréquence fondamentale constante permet de pré-calculer l'ensemble des structures harmoniques des unités de synthèse lors de la phase d'apprentissage. Le re-échantillonnage se fait alors uniquement sur le segment à coder. D'autre part, si on se limite à un alignement temporel par interpolation linéaire, il est possible de moyenner les structures harmoniques sur l'ensemble des segments considérés. La mesure de similarité peut alors être estimée simplement à partir de la structure harmonique moyenne du segment à coder, et celle de l'unité de synthèse considérée. Cette mesure de similarité peut aussi être une mesure d'intercorrélation normalisée. On peut aussi noter que la procédure de ré-échantillonnage peut s'effectuer sur une échelle perceptuelle des fréquences (Mel ou Bark).The similarity criterion will depend on the spectral parameters used (for example the type of parameters used for the representation of the envelope). Several types of spectral parameters can be used, as they allow to define a distortion measure spectral. In the field of speech coding, it is common to use the LSP or LSF parameters (LSP, Line Spectral Pair, LSF, Line Spectral Frequencies) derived from a linear prediction analysis. In the field of speech recognition, cepstral parameters are generally used, and they can either be derived from a Linear Prediction Cepstrum Coefficients (LPCC) or estimated from a filter bank often on a scale Mel or Bark perceptual (MFCC, Mel Frequency Cepstrum Coefficients). It is also possible, since we use a sinusoidal modeling of the harmonic component of the speech signal, to directly use the amplitudes of the harmonic frequencies. These last parameters being estimated according to the pitch can not be used directly to calculate a distance. The number of coefficients obtained is indeed variable according to the pitch, unlike the parameters LPCC, MFCC or LSF. A pretreatment then consists in estimating a spectral envelope from the harmonic amplitudes (linear interpolation or polynomial spline type) and resampling the envelope thus obtained, either by using the fundamental frequency of the segment to be coded, or by using a frequency fundamental constant (100 Hz for example). A constant fundamental frequency makes it possible to pre-calculate all the harmonic structures of the synthesis units during the learning phase. The re-sampling is then done only on the segment to be coded. On the other hand, if one limits oneself to a temporal alignment by linear interpolation, it is possible to average the harmonic structures on all the segments considered. The similarity measure can then be estimated simply from the average harmonic structure of the segment to be encoded, and that of the synthesis unit considered. This similarity measure can also be a standardized cross-correlation measure. It can also be noted that the resampling procedure can be performed on a perceptual scale of frequencies (Mel or Bark).

Pour la procédure d'alignement temporel il est possible d'utiliser soit un algorithme de programmation dynamique (DTW, Dynamic Time Warping), soit d'effectuer une interpolation linéaire simple (ajustement linéaire des longueurs). Dans l'hypothèse où l'on ne souhaite pas transmettre d'information supplémentaire relative au chemin d'alignement, il est préférable d'utiliser une simple interpolation linéaire des paramètres. La prise en compte du meilleur alignement est alors en partie réaliser par la procédure de sélection.
Codage du pitch par modification du profil de synthèseFor the time alignment procedure it is possible to use either a Dynamic Time Warping (DTW) algorithm or to perform a simple linear interpolation (linear length adjustment). In the event that it is not desired to transmit additional information relating to the alignment path, it is preferable to use a simple linear interpolation of the parameters. Taking into account the best alignment is then partly achieved by the selection procedure.
Pitch coding by modifying the synthesis profile

Selon un mode de réalisation, le procédé comporte une étape de codage du pitch par modification du profil de synthèse. Cela consiste à resynthétiser un profil de pitch à partir de celui de l'unité de synthèse sélectionnée et un gain linéairement variable sur la durée du segment à coder. Il suffit alors de transmettre une valeur supplémentaire pour caractériser le gain correcteur sur l'ensemble du segment.According to one embodiment, the method comprises a pitch coding step by modifying the synthesis profile. This consists of resynthesizing a pitch profile from that of the selected synthesis unit and a linearly variable gain over the duration of the segment to be encoded. It is then sufficient to transmit an additional value to characterize the correction gain over the entire segment.

Le pitch reconstruit au niveau du décodeur est donné par l'équation suivante : ${\hat{f}}_{0} (n) = g (n) . f_{OS} (n) = (a . n + b) . f_{OS} (n)$

où f_0S (n) est le pitch à la trame d'indice n de l'unité de synthèse.
Cela correspond à une transformation linéaire du profil du pitch.
Les valeurs optimales de a et b sont estimées au niveau du codeur en minimisant l'erreur quadratique moyenne :

\sum_{n} e_{0}^{2} (n) = \sum_{n} {[f_{0} (n) - {\hat{f}}_{0} (n)]}^{2}

ce qui conduit aux relations suivantes :

a = \frac{(S_{4} . S_{2} - S_{5} . S_{1})}{(S_{2} . S_{2} - S_{3} . S_{1})}

et

b = \frac{(S_{5} . S_{2} - S_{4} . S_{3})}{(S_{2} . S_{2} - S_{3} . S_{1})}

où

S_{1} = \sum_{n} f_{0} (n) . f_{0 S} (n)

S_{2} = \sum_{n} n . f_{0 S} (n) . f_{0 S} (n)

S_{3} = \sum_{n} n^{2} . f_{OS} (n) . f_{OS} (n)

S_{4} = \sum_{n} f_{0} (n) . f_{0 S} (n)

S_{5} = \sum_{n} n . f_{0 S} (n) . f_{0 S} (n)

Le coefficient a, ainsi que la valeur moyenne du pitch modélisé sont quantifiés et transmis :

a_{q} = Q [a]

f_{0 q} = Q [\frac{\sum_{n} (a . n + b) . f_{0 S} (n)}{N}]

La valeur du coefficient b est obtenu au niveau du décodeur à partir de la relation suivante :

b_{q} = \frac{f_{0_{q}} - \frac{Σ a_{q} . n . f_{0 S} (n)}{N}}{〈 f_{0 S} 〉}

où 〈f_0S〉 est le pitch moyen de l'unité de synthèse.
Remarque : cette méthode de correction peut bien entendu s'appliquer au profil énergétique.The pitch reconstructed at the level of the decoder is given by the following equation:

{\hat{f}}_{0} (not) = boy Wut (not) . f_{BONE} (not) = (at . not + b) . f_{BONE} (not)

where f _0S ( n ) is the pitch at the frame of index n of the synthesis unit.
This corresponds to a linear transformation of the pitch profile.
The optimal values of a and b are estimated at the encoder level by minimizing the mean squared error:

\underset{not}{Σ} e_{0}^{2} (not) = \underset{not}{Σ} {[f_{0} (not) - {\hat{f}}_{0} (not)]}^{2}

which leads to the following relationships:

at = \frac{(S_{4} . S_{2} - S_{5} . S_{1})}{(S_{2} . S_{2} - S_{3} . S_{1})}

and

b = \frac{(S_{5} . S_{2} - S_{4} . S_{3})}{(S_{2} . S_{2} - S_{3} . S_{1})}

or

S_{1} = \underset{not}{Σ} f_{0} (not) . f_{0 S} (not)

S_{2} = \underset{not}{Σ} not . f_{0 S} (not) . f_{0 S} (not)

S_{3} = \underset{not}{Σ} {not}^{2} . f_{BONE} (not) . f_{BONE} (not)

S_{4} = \underset{not}{Σ} f_{0} (not) . f_{0 S} (not)

S_{5} = \underset{not}{Σ} not . f_{0 S} (not) . f_{0 S} (not)

The coefficient a, as well as the average value of the modeled pitch are quantized and transmitted:

{at}_{q} = Q [at]

f_{0 q} = Q [\frac{\underset{not}{Σ} (at . not + b) . f_{0 S} (not)}{NOT}]

The value of the coefficient b is obtained at the level of the decoder from the following relation:

b_{q} = \frac{f_{0_{q}} - \frac{Σ {at}_{q} . not . f_{0 S} (not)}{NOT}}{< f_{0 S} >}

where <f _0S > is the average pitch of the synthesis unit.
Note: This correction method can of course be applied to the energy profile.

Flow example associated with the encoding scheme

Les informations relatives au débit associé au schéma de codage décrit précédemment sont les suivantes :

Indice de classe sur 6 bits (64 classes)
Indice de l'unité sélectionnée sur 5 bits (32 unités par sous-classe)
Longueur du segment sur 4 bits (de 3 à 18 trames)

The information relating to the bit rate associated with the coding scheme previously described is as follows:

6-bit class index (64 classes)
Index of the unit selected on 5 bits (32 units per subclass)
Segment length on 4 bits (from 3 to 18 frames)

Le nombre moyen de segments par seconde se situe entre 15 et 20; ce qui conduit à un débit de base situé entre 225 et 300 bits/sec pour la configuration précédente. A ce débit de base vient s'ajouter le débit nécessaire pour représenter l'information de pitch et d'énergie.

FO moyen sur 5 bits
Coefficient correcteur du profil de pitch sur 5bits
Gain correcteur sur 5 bits

The average number of segments per second is between 15 and 20; which leads to a base rate of between 225 and 300 bits / sec for the previous configuration. At this base rate is added the flow required to represent the information of pitch and energy.

Mean FO on 5 bits
Corrective coefficient of the 5bits pitch profile
5-bit correction gain

Le débit associé à la prosodie se situe alors entre 225 et 300 bits/sec, ce qui conduit à un débit global entre 450 et 600 bits/sec.The rate associated with the prosody is then between 225 and 300 bits / sec, which leads to an overall rate between 450 and 600 bits / sec.

References

[1] G. Baudoin, F. El Chami, "Corpus based very low bit rate speech coder", Proc. Conf. IEEE ICASSP 2003, Hong Kong, 2003 .
[2] G. Baudoin, J. Cernocky, P. Gournay, G. Chollet, "Coding of Low and Very Low Flow Speech", Annals of Telecommunications, Vol. 55, No. 9-10 Pages 421-456, Nov. 2000 .
[3] G. Baudoin, F. Capman, J. Cernocky, F. El-chami, M. Charbit, G. Chollet, D. Petrovska-Delacretaz. "Advances in Very Low Rate Speech Coding using Recognition and Synthesis Techniques ", TSD '2002, pp. 269-276, Brno, Czech Republic, Sept 2002 .
[4] K.Lee, R.Cox, 'A segmental coder based on concatenative TTS', in Speech Communications, Vol 38, pp 89-100, 2002 .
[5] K.Lee, R.Cox, "A very low bit rate speech coder based on a recognition / synthesis paradigm", in IEEE on ASSP, Vol; 9, pp 482-491, July 2001 .

Claims

Method for selecting synthesis units of an item of information taking the form of a speech segment to be coded and able to be decomposed into synthesis units, characterized in that it comprises at least the following steps:
for an information segment considered:
• determining the value F0 of the mean fundamental frequency for the information segment considered,

• selecting a subset of synthesis units defined as being the subset whose mean pitch values are closest to the pitch value F0,

• applying one or more proximity criteria to the selected synthesis units so as to determine a synthesis unit representative of the information segment.
Method for selecting synthesis units according to Claim 1, characterized in that the fundamental frequency or pitch, the spectral distortion, and/or the energy profile are used as proximity criteria and a step of merging the criteria is executed so as to determine the representative synthesis unit.
Method for selecting units according to Claim 1, characterized in that for a speech segment to be coded the reference pitch is obtained on the basis of a prosody generator.
Method according to Claim 2, characterized in that the estimation of the similarity criterion for the profile of the pitch comprises at least the following steps:
A1) selecting from the identified sub-class of the dictionary of the synthesis units and on the basis of the mean value of the pitch, the N closest units within the sense of the mean pitch criterion,

A2) temporally aligning the N profiles with that of the segment to be coded,

A3) calculating N measures of similarity, between the N aligned pitch profiles and the profile of the pitch of the speech segment to be coded so as to obtain the N similarity coefficients {rp(1), rp(2),....rp(N)}.
Method according to Claim 2, characterized in that the similarity estimation for the energy profile comprises at least the following steps:
A4) determining the evolution profiles of the energy for the N units selected according to a proximity of the mean pitch criterion,

A5) temporally aligning the N profiles with that of the segment to be coded,

A6) calculating N measures of similarity, between the N aligned energy profiles and the energy profile of the speech segment to be coded so as to obtain the N similarity coefficients {re(1), re(2), ...., re(N)}.
Method according to Claim 2, characterized in that the estimation of the similarity criteria for the spectral envelope comprises at least the following steps:
A7) temporally aligning the N profiles with that of the segment to be coded,

A8) determining the evolution profiles of the spectral parameters for the N units selected according to a proximity of the mean pitch criterion,

A9) calculating N measures of the similarities, between the spectral sequence of the segment to be coded and the corresponding N spectral sequences extracted from the speech segment to be coded so as to obtain the N similarity coefficients {rs(1), rs(2), ...., rs(N)}.
Method according to one of Claims 4, 5 and 6, characterized in that the temporal alignment is a temporal alignment obtained by dynamic programming (DTW) or an alignment by linear adjustment of the lengths.
Method according to one of Claims 4, 5 and 6, characterized in that the similarity measure is a normalized intercorrelation measure.
Method according to Claim 6, characterized in that the similarity measure is a spectral distance measure.
Method according to Claim 6, characterized in that step A9) comprises a step where the set of the spectra of one and the same segment is averaged and in that the similarity measure is an intercorrelation measure.
Method according to Claim 6, characterized in that the spectral distortion criterion is calculated on harmonic structures re-sampled at constant pitch or re-sampled at the pitch of the segment to be coded, after interpolation of the initial harmonic structures.
Method according to one of Claims 1 to 11, characterized in that it comprises a step of coding and/or a step of correcting the pitch by modification of the synthesis profile.
Method according to Claim 12, characterized in that the step of coding and/or correcting the pitch is a linear transformation of the profile of the original pitch.
Use of the method according to one of Claims 1 to 12 for the selection and/or the coding of synthesis units for a very low bit rate speech coder.