JP2006333224A

JP2006333224A - Voice signal packet communication method, multipoint mixing method, voice signal packet receiving method, and system and apparatus using them

Info

Publication number: JP2006333224A
Application number: JP2005155892A
Authority: JP
Inventors: Naka Omuro; 仲大室; Yuusuke Hiwazaki; 祐介日和▲崎▼; Takeshi Mori; 岳至森; Akitoshi Kataoka; 章俊片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-05-27
Filing date: 2005-05-27
Publication date: 2006-12-07
Anticipated expiration: 2025-05-27
Also published as: JP4403103B2

Abstract

【課題】多地点ミキシング部でパケットロスコンシールメント処理を行う方法では、多地点ミキシング部に過大な負荷がかかる可能性があった。
【解決手段】本発明においては、多地点ミキシング部までの上りにおけるパケットロスが発生したときに、多地点ミキシング部ではパケットロスコンシールメント処理は行わず、ミキシング後に送信する音声パケットに、「上りでパケットロスが発生した」旨を示すフラグを組み込んで受信側に送信する。受信側では、多地点ミキシング部からの下りにおけるパケットロスが発生した場合、および前記の「上りでパケットロスが発生した」旨を示すフラグを音声パケット内から検出した場合に、パケットロスコンシールメント処理を行う。
また、複数の送信元の中で、パケットロスが発生する直前に主話者であったパケットが紛失した場合にのみパケットロスフラグを組み込んで受信側に送信する。
【選択図】図６In a method of performing packet loss concealment processing in a multipoint mixing unit, an excessive load may be applied to the multipoint mixing unit.
In the present invention, when an upstream packet loss to a multipoint mixing unit occurs, the multipoint mixing unit does not perform packet loss concealment processing, A flag indicating that a packet loss has occurred is incorporated and transmitted to the receiving side. On the receiving side, packet loss concealment processing is performed when a packet loss occurs in the downstream from the multipoint mixing unit and when the flag indicating that “packet loss has occurred in the upstream” is detected from within the voice packet. I do.
Also, a packet loss flag is incorporated and transmitted to the receiving side only when a packet that was the main speaker is lost immediately before packet loss occurs among a plurality of transmission sources.
[Selection] Figure 6

Description

この発明は、ディジタル化された音声・音楽などの音響信号（以下、総称して「音声信号」という。）を、インターネットをはじめとするパケット通信網を介して送信する際に、パケット紛失対策をした通信方法、多地点ミキシング方法、受信方法、これらのシステムと装置に関する。 The present invention provides a countermeasure for packet loss when digitalized audio signals such as voice and music (hereinafter collectively referred to as “voice signals”) are transmitted via a packet communication network such as the Internet. The present invention relates to a communication method, a multipoint mixing method, a reception method, and a system and apparatus thereof.

音声信号をボイスオーバ（Voice over）ＩＰ（インターネットプロトコル）技術を利用して送信するサービスが普及しつつある。図１に示すように入力端子１１よりの音声信号を音声信号送信部１２で音声パケットに変換してＩＰ網をはじめとするパケット通信網１３によって音声信号受信部１４へ送信し、音声信号受信部１４により音声パケットを復号して再生音声を出力端子１５へ出力する。これをリアルタイム通信する場合、通信網１３の状態によっては通信網の途中においてパケットロス（紛失）が生じ、それによって再生音声が途切れるといった品質劣化が問題となっている。特に、インターネットなどのベストエフォートと呼ばれる通信サービスの場合には、パケットロスを許容しているため通信網の混雑時に特にこの問題が顕著である。 Services that transmit voice signals using Voice over IP (Internet Protocol) technology are becoming popular. As shown in FIG. 1, a voice signal from an input terminal 11 is converted into a voice packet by a voice signal transmitter 12 and transmitted to a voice signal receiver 14 by a packet communication network 13 such as an IP network. 14 decodes the voice packet and outputs the reproduced voice to the output terminal 15. When this is performed in real time, depending on the state of the communication network 13, a packet loss (loss) occurs in the middle of the communication network, which causes a problem of quality degradation such that the reproduced voice is interrupted. In particular, in the case of a communication service called “best effort” such as the Internet, this problem is particularly noticeable when the communication network is congested because packet loss is allowed.

そこで、音声信号をパケット通信網で通信する場合には、パケットロスコンシールメントと呼ばれる手法を用いる。この手法は、パケットが通信路の途中で消失した場合や通信路の遅延によって制限時間内に受信側に届かなかった場合に、消失または届かなかったパケット（以下、「ロスパケット」又は「紛失パケット」という。）に対応する区間の音声信号を受信側で推定して補償する方法が用いられる。図２は、図１における音声信号送信部１２の一般的な構成例である。入力音声は入力バッファ２１に蓄えられ、フレームと呼ばれる一定の時間ごとに区切って音声パケット化部２２に送られる。音声パケット化部２２は、前記フレーム化された音声を音声符号化の手法を用いて音声符号に変換し、音声パケットを生成して、パケット送出部２３に送る。パケット送出部２３よりパケット通信網に音声パケットを送出する。なお、１フレームの時間長は一般には、１０ミリ秒から２０ミリ秒程度とすることが多い。また、音声符号化の手法には、ＩＴＵ−ＴＧ．７１１方式など、任意の符号化方式を用いてよく、ＰＣＭ（パルス符号変調）信号の形式でもよい。 Therefore, when voice signals are communicated via a packet communication network, a technique called packet loss concealment is used. This method is used when a packet is lost in the middle of a communication path or when it has not arrived at the receiving side within a limited time due to a delay in the communication path (hereinafter referred to as “lost packet” or “lost packet”). "). The method of estimating and compensating the speech signal in the section corresponding to") on the receiving side is used. FIG. 2 is a general configuration example of the audio signal transmission unit 12 in FIG. The input voice is stored in the input buffer 21 and sent to the voice packetization unit 22 after being divided at regular intervals called frames. The voice packetization unit 22 converts the framed voice into a voice code using a voice coding method, generates a voice packet, and sends the voice packet to the packet sending unit 23. The packet sending unit 23 sends a voice packet to the packet communication network. Note that the time length of one frame is generally about 10 to 20 milliseconds in many cases. In addition, the ITU-T G. An arbitrary encoding method such as the 711 method may be used, and a PCM (pulse code modulation) signal format may be used.

図３は、図１における音声信号受信部１４の一般的な構成例である。パケット通信網からパケット受信部３１で受信した音声パケットは、ゆらぎ吸収バッファとも呼ばれる受信バッファ３２に蓄えられる。正しくパケットが受信されたフレームについては、受信バッファから音声パケットが取り出され、音声パケット復号部３３で音声信号に復号され、パケットロスしたフレームについては、紛失信号生成部３４でパケットロスコンシールメント処理を行って音声信号が生成され、生成された音声信号が出力される。パケットロスコンシールメントの処理に、ピッチ周期（音声の基本周波数に相当する時間軸上での長さ）の情報を利用する場合には、出力音声信号を出力音声バッファ３５に蓄え、ピッチ抽出部３６でピッチ分析し、得られたピッチ周期の値を紛失信号生成部３４に供給する。紛失信号生成部３４で生成された信号は切替スイッチ３７を通じて出力端子１５へ出力され、パケットロスがない場合は音声パケット復号部３３よりの復号音声信号が切替スイッチ３７を通じて出力端子１５へ出力される。ここで、紛失信号生成部３４、出力音声バッファ３５およびピッチ抽出部３６によって、パケットロスコンシールメントは構成される。なお、双方向で音声通信を行う通信端末は、各端末に送信部と受信部の両方を具備する。 FIG. 3 is a general configuration example of the audio signal receiving unit 14 in FIG. Voice packets received by the packet receiver 31 from the packet communication network are stored in a reception buffer 32 also called a fluctuation absorbing buffer. For a frame in which a packet is correctly received, a voice packet is extracted from the reception buffer and decoded into a voice signal by the voice packet decoding unit 33. For a lost packet, a packet loss concealment process is performed by the lost signal generation unit 34. A sound signal is generated, and the generated sound signal is output. When the information of the pitch period (the length on the time axis corresponding to the fundamental frequency of voice) is used for packet loss concealment processing, the output voice signal is stored in the output voice buffer 35 and the pitch extraction unit 36 is used. And the obtained pitch period value is supplied to the lost signal generator 34. The signal generated by the lost signal generation unit 34 is output to the output terminal 15 through the changeover switch 37, and when there is no packet loss, the decoded audio signal from the audio packet decoding unit 33 is output to the output terminal 15 through the changeover switch 37. . Here, the lost signal generation unit 34, the output audio buffer 35, and the pitch extraction unit 36 constitute a packet loss concealment. Note that a communication terminal that performs voice communication in both directions includes both a transmission unit and a reception unit in each terminal.

パケットロスコンシールメントの代表的な方法としては、非特許文献１や非特許文献２に示す方法などがある。非特許文献１に示す方法は、よく知られている方法であり、音声のピッチ周期をパケットロスコンシールメントに利用している。
パケットロスコンシールメントは、パケットロスが発生した場合に、受信側の再生音声の聴感上の劣化を感じにくくする有効な方法である。たとえば、図１の形態で、音声信号受信部にその機能を実装したり、音声信号送信部であらかじめ冗長な情報をつけて送り、音声信号受信部でより再生音声の劣化の少ない処理を実装したりする。一般に、図１の形態でパケットロスコンシールメントを実装することは、処理量の観点で問題となることはない。 As a typical method of packet loss concealment, there are methods shown in Non-Patent Document 1 and Non-Patent Document 2. The method shown in Non-Patent Document 1 is a well-known method, and uses the pitch period of voice for packet loss concealment.
Packet loss concealment is an effective method for making it difficult to perceive deterioration in the audibility of the playback sound on the receiving side when packet loss occurs. For example, in the form of FIG. 1, the function is implemented in the audio signal receiving unit, or the audio signal transmitting unit adds redundant information in advance, and the audio signal receiving unit implements processing with less deterioration of the reproduced audio. Or In general, mounting the packet loss concealment in the form of FIG. 1 does not cause a problem in terms of processing amount.

次に、複数の音声通信装置間で多地点通信を行う場合を示す。一般的には、多地点通信の場合には、図４に示すように多地点ミキシング部１００を用意し、各音声通信装置２００Ａ〜Ｄからの送信信号を多地点ミキシング部１００で合成して各音声通信装置２００Ａ〜Ｄに送る。音声通信装置２００Ａ〜Ｃの音声信号送信部３００Ａ〜Ｃから送出された音声パケットを、音声通信装置２００Ｄの音声信号受信部４００Ｄが受取る場合の流れを、図５に示す。音声信号送信部３００Ａ、音声信号送信部３００Ｂ、音声信号送信部３００Ｃがそれぞれ音声パケットをＩＰ通信網１３を介して、多地点ミキシング部１００（多地点ミキシングサーバとも呼ぶ。）に送る。多地点ミキシング部１００は音声パケットＡ、音声パケットＢ、音声パケットＣをそれぞれ復号し、ミキシング(復号音声の加算)を行った後、符号化しなおして音声信号受信部４００ＤにＩＰ通信網１３を介して音声パケットＭを送る。この場合に、音声信号送信部３００Ａ、音声信号送信部３００Ｂ、あるいは音声信号送信部３００Ｃから多地点ミキシング部１００の間でのパケットロス（以下、「多地点ミキシング部までの上りにおけるパケットロス」という。）が発生する。また、多地点ミキシング部１００から音声信号受信部４００Ｄの間にもパケットロス（以下、「多地点ミキシング部からの下りにおけるパケットロス」という。）が発生する。従来は、多地点ミキシング部までの上りにおけるパケットロスへの対策として、多地点ミキシング部１００にパケットロスコンシールメントの機能を実装し、多地点ミキシング部からの下りにおけるパケットロスへの対策として、音声信号受信部４００にパケットロスコンシールメントの機能を実装していた。なお、図５は音声信号送信部３００Ａ〜Ｃの３地点の例であるが、４地点以上の多地点音声通信の場合もある。
ITU-T Recommendation G.711 Appendix I, “A high quality low-complexity algorithm for packet loss concealment with G.711”,pp.1-18, 1999. 大室仲，他“音声特徴量並行送信によるバーストパケットロス耐性の向上”, 信学技報（電子情報通信学会）, SP2004-77, pp.35-40, 2004. Next, a case where multipoint communication is performed between a plurality of voice communication apparatuses will be described. In general, in the case of multipoint communication, a multipoint mixing unit 100 is prepared as shown in FIG. 4, and transmission signals from the respective voice communication apparatuses 200 </ b> A to 200 </ b> D are combined by the multipoint mixing unit 100. Send to voice communication devices 200A-D. FIG. 5 shows a flow when the voice signal receiving unit 400D of the voice communication device 200D receives voice packets transmitted from the voice signal transmission units 300A to 300C of the voice communication devices 200A to 200C. Each of the audio signal transmission unit 300A, the audio signal transmission unit 300B, and the audio signal transmission unit 300C sends an audio packet to the multipoint mixing unit 100 (also referred to as a multipoint mixing server) via the IP communication network 13. The multipoint mixing unit 100 decodes the voice packet A, the voice packet B, and the voice packet C, performs mixing (addition of the decoded voice), re-encodes the voice packet A, and sends the voice signal reception unit 400D via the IP communication network 13. Voice packet M. In this case, the packet loss between the audio signal transmission unit 300A, the audio signal transmission unit 300B, or the audio signal transmission unit 300C and the multipoint mixing unit 100 (hereinafter referred to as “packet loss in the upstream to the multipoint mixing unit”). .) Occurs. Further, a packet loss (hereinafter referred to as “packet loss in the downstream from the multipoint mixing unit”) also occurs between the multipoint mixing unit 100 and the audio signal receiving unit 400D. Conventionally, a packet loss concealment function has been implemented in the multipoint mixing section 100 as a countermeasure against upstream packet loss up to the multipoint mixing section, and a voice response has been implemented as a countermeasure against downstream packet loss from the multipoint mixing section. A packet loss concealment function was implemented in the signal receiving unit 400. FIG. 5 shows an example of three points of the audio signal transmission units 300A to 300C, but there may be multipoint voice communication of four or more points.
ITU-T Recommendation G.711 Appendix I, “A high quality low-complexity algorithm for packet loss concealment with G.711”, pp.1-18, 1999. Omuro Naka, et al. “Improvement of burst packet loss tolerance by parallel transmission of voice features”, IEICE Technical Report (Institute of Electronics, Information and Communication Engineers), SP2004-77, pp.35-40, 2004.

多地点ミキシング部までの上りにおけるパケットロスが発生したときに、多地点ミキシング部（サーバ）でパケットロスコンシールメント処理を行う従来の方法においては、例えば、複数の地点からのパケットが同時にロスした場合に、多地点ミキシングサーバにかかるパケットロスコンシールメント処理の負荷が増大するという問題がある。すなわち、図１に示した１対１の通信の場合のパケットロスコンシールメント処理の負荷を１とすると、例えば多地点ミキシング部までの上りでＮ地点分のパケットが同時にロスしたとすると、サーバにかかる負荷はＮとなる。多地点接続の地点数が増えるほど、また、1台のサーバで処理する多地点接続の組（多地点会議の数とも呼ぶ）が多いほど、多地点ミキシング部（サーバ）に過大な負荷がかかる可能性があった。 In the conventional method in which packet loss concealment processing is performed in the multipoint mixing unit (server) when packet loss in the upstream to the multipoint mixing unit occurs, for example, when packets from multiple points are lost simultaneously In addition, there is a problem that the load of packet loss concealment processing applied to the multipoint mixing server increases. That is, assuming that the load of packet loss concealment processing in the case of one-to-one communication shown in FIG. Such a load is N. As the number of multipoint connection points increases and the number of multipoint connection groups (also called the number of multipoint conferences) processed by one server increases, the load on the multipoint mixing unit (server) increases. There was a possibility.

本発明においては、多地点ミキシング部までの上りにおけるパケットロスが発生したときに、多地点ミキシング部（サーバ）ではパケットロスコンシールメント処理は行わず、ミキシング後に送信する音声パケットに、「上りでパケットロスが発生した」旨を示すフラグ（以下、「パケットロスフラグ」という。）を組み込んで受信側に送信する。受信側では、多地点ミキシング部からの下りにおけるパケットロスが発生した場合、および前記の「上りでパケットロスが発生した」旨を示すフラグを音声パケット内から検出した場合に、そのパケットは(実際には受信されていても)ロスしたものとみなしてパケットロスコンシールメント処理を行う。 In the present invention, when an upstream packet loss to the multi-point mixing unit occurs, the multi-point mixing unit (server) does not perform the packet loss concealment process, A flag indicating that “loss has occurred” (hereinafter referred to as “packet loss flag”) is incorporated and transmitted to the receiving side. On the receiving side, when a packet loss occurs in the downstream from the multipoint mixing unit, and when the flag indicating that “packet loss has occurred in the upstream” is detected from within the voice packet, the packet is Packet loss concealment processing is performed on the assumption that the packet is lost.

また、複数の送信元の中で、パケットロスが発生する直前に主話者であったパケットが紛失した場合にのみパケットロスフラグを組み込んで受信側に送信する。 Also, a packet loss flag is incorporated and transmitted to the receiving side only when a packet that was the main speaker is lost immediately before packet loss occurs among a plurality of transmission sources.

本発明によれば、多地点ミキシング部までの上りにおいてパケットロスが発生したときでも、多地点ミキシング部（サーバ）の負荷が高くなることなく、受信側で聴感上の音質劣化の少ない音声を再生することができる。また、主話者でない送信元のパケットが紛失した場合には、受信側でパケットロスコンシールメント処理を行わないので、通話品質に影響が少ない処理を削減できる。 According to the present invention, even when a packet loss occurs up to the multi-point mixing unit, the multi-point mixing unit (server) does not increase the load, and the reception side reproduces sound with less sound quality deterioration on the perception. can do. In addition, when a packet of a transmission source that is not the main speaker is lost, the packet loss concealment process is not performed on the receiving side, so that it is possible to reduce the process with little influence on the call quality.

本発明は、コンピュータ本体とコンピュータプログラムとして実行することが可能であるし、デジタルシグナルプロセッサや専用LSIに実装して実現することも可能である。
［第１実施形態］
図６に、本発明の多地点ミキシング部の構成例を示す。また、多地点ミキシング部の処理フローを図７に示す。多地点ミキシング部１００は、音声パケットを受信するパケット受信部３１、受信したパケットを一時的に蓄積する受信バッファ３２、パケットロスの発生を検出するパケットロス検出部１１０、受信した音声パケットの中で主に話している話者のパケットがどれかを判定する主話者判定部１２０、パケットロスフラグを生成するパケットロスフラグ生成部１４０、音声パケットに含まれる音声符号を音声波形に復号する音声波形復号部１３０、音声信号を加算する加算部１５０、音声波形を符号化する音声波形符号化部１６０、音声符号とパケットロスフラグをパケットに組み込むパケット化部１７０、パケットを送信するパケット送出部２３から構成されている。図６は図５と同様に、３地点から音声パケットを受けてミキシングを行い、ミキシング結果の音声パケットを１地点に送信する例であるが、送信元地点は４地点以上でもよく、送信先は２地点以上でも、送信元と同じ地点でもよい。また、一般的に、多地点音声通信会議では送信元にミキシング結果を戻す。 The present invention can be executed as a computer main body and a computer program, or can be realized by being mounted on a digital signal processor or a dedicated LSI.
[First Embodiment]
In FIG. 6, the structural example of the multipoint mixing part of this invention is shown. Moreover, the processing flow of a multipoint mixing part is shown in FIG. The multipoint mixing unit 100 includes a packet reception unit 31 that receives voice packets, a reception buffer 32 that temporarily stores received packets, a packet loss detection unit 110 that detects occurrence of packet loss, and a received voice packet. A main speaker determination unit 120 that determines which packet of a speaker is mainly speaking, a packet loss flag generation unit 140 that generates a packet loss flag, and a voice waveform that decodes a voice code included in a voice packet into a voice waveform From the decoding unit 130, the adding unit 150 for adding the audio signal, the audio waveform encoding unit 160 for encoding the audio waveform, the packetizing unit 170 for incorporating the audio code and the packet loss flag into the packet, and the packet sending unit 23 for transmitting the packet It is configured. FIG. 6 shows an example in which voice packets are received from three points and mixed, and the voice packet resulting from the mixing is transmitted to one point, as in FIG. It may be two or more points or the same point as the transmission source. In general, in a multipoint audio communication conference, the mixing result is returned to the transmission source.

音声パケットＡ〜Ｃは、それぞれのパケット受信部３１で受信され（Ｓ３１）、受信バッファ３２に蓄えられる（Ｓ３２）。蓄えられた音声信号は、パケットの順番に従って音声波形復号部１３０Ａ〜Ｃで、音声信号に復号される（Ｓ１３０）。加算部１５０では、復号された音声信号が加算され（Ｓ１５０）、音声波形符号化部１６０で再度符号化される。多地点ミキシング部までの上りにおけるパケットロスが発生したときには、各音声波形復号部は、ゼロの信号(無音)を出力することとする。これは、パケットロスの発生していない地点の音声のみを復号して加算することと等しい。 The voice packets A to C are received by the respective packet receivers 31 (S31) and stored in the reception buffer 32 (S32). The stored audio signals are decoded into audio signals by the audio waveform decoding units 130A to 130C in accordance with the packet order (S130). The adder 150 adds the decoded speech signals (S150), and the speech waveform encoder 160 encodes them again. When an upstream packet loss up to the multipoint mixing unit occurs, each speech waveform decoding unit outputs a zero signal (silence). This is equivalent to decoding and adding only the voice at a point where no packet loss occurs.

パケットロス検出部１１０は、各地点からの音声パケットの受信状態を監視し、パケットロスが発生したかを判断する（Ｓ１１１）。主話者判定部１２０は、音声パケットＡ〜Ｃに含まれる情報を用いて、その時刻での送信地点Ａ〜Ｃのどの話者が主として発言しているかを時々刻々判断する（Ｓ１２１）。「主として発言している」とは、例えば地点Ｂの話者のみが発言中で、他の２地点の話者が発言していなければ、主として発言している地点はＢであり、主たる音声パケットは音声パケットＢである。同時に２名以上が発言している場合には、音声の大きさや発言の継続状態から、いずれが主たる発言者であるかを判断する。 The packet loss detection unit 110 monitors the reception state of voice packets from each point and determines whether a packet loss has occurred (S111). The main speaker determination unit 120 uses the information contained in the voice packets A to C to determine which speaker at the transmission point A to C at that time is mainly speaking (S121). “Mainly speaking” means that, for example, if only the speaker at the point B is speaking and the speakers at the other two points are not speaking, the point where the speaker is mainly speaking is B, and the main voice packet Is a voice packet B. If two or more people are speaking at the same time, it is determined which is the main speaker from the volume of the voice and the continuation state of the speech.

なお、「音声パケットＡ〜Ｃに含まれる情報を用いて」とは、パケットに含まれる音声符号を分析して判断してもよい。もしくは、音声パケットＡ〜Ｃにあらかじめ送信側から主たる発言者を決定するための付加的な情報が含まれていれば、その付加的な情報を用いて判断してもよい。サーバ負荷軽減の観点でいえば、後者の方が望ましい。送信側であらかじめ主たる発言者を決定するための付加的な情報、例えば、音声区間か非音声区間かの識別子や、パワの情報をパケットに組み込む方法がある。 Note that “using information contained in voice packets A to C” may be determined by analyzing a voice code contained in the packet. Alternatively, if the voice packets A to C include additional information for determining the main speaker from the transmission side in advance, the determination may be made using the additional information. From the viewpoint of server load reduction, the latter is preferable. There is a method in which additional information for determining a main speaker in advance on the transmission side, for example, an identifier of a voice section or a non-voice section, or power information is incorporated into a packet.

パケットロスフラグ生成部１４０では、パケットロス検出部１１０で、音声パケットＡ〜Ｃのいずれか、または複数のパケットでロスが検出された場合（Ｓ１１２）、かつ、直前の時刻に、当該パケットロスが検出された地点が主たる発言者であった場合（Ｓ１２２）（例えば、直前の時刻において地点Ｂが主たる発言者であったとして、次の時刻に音声パケットＢがロスであると判定された場合）、パケットロスフラグとして例えば１をセットし（Ｓ１４１）、それ以外では０をセットする（Ｓ１４２）。それ以外の例としては、主たる発言者が地点Ａで、音声パケットＣがロスであった場合や、全地点のパケットがすべてロスでない場合が該当する。なお、音声のパケット通信においては、音声を１０ミリ秒から２０ミリ秒程度のフレー一ムと呼ばれる区間に区切り、音声符号化した後パケットにして送信するのが一般的であるため、「時刻」とはフレーム単位の時刻順序を示す。また、時刻順序は一般にパケットのヘッダにタイムスタンプとして記録されている。 In the packet loss flag generation unit 140, when the packet loss detection unit 110 detects a loss in one or a plurality of packets of the voice packets A to C (S112), the packet loss is detected at the immediately preceding time. When the detected point is the main speaker (S122) (for example, when the point B is the main speaker at the previous time and it is determined that the voice packet B is lost at the next time) For example, 1 is set as the packet loss flag (S141), and 0 is set otherwise (S142). Other examples include the case where the main speaker is the point A and the voice packet C is lost, and the case where all the packets at all points are not lost. In voice packet communication, since it is common to divide voice into sections called frames of about 10 milliseconds to 20 milliseconds, encode the voice and then send it as a packet. Indicates a time order in units of frames. The time order is generally recorded as a time stamp in the header of the packet.

パケット化部１７０は、音声波形符号化部１６０の出力である音声符号と、パケットロスフラグ生成部１４０の出力であるパケットロスフラグの情報を組合せて1つのパケットを生成し（Ｓ１７０）、パケット送出部２３が音声パケットＭを送信する（Ｓ２３）。
図８に、本発明における音声信号受信部の構成例を示す。また、図９に音声信号受信部の処理フローを示す。音声信号受信部４００は、音声パケットを受信するパケット受信部３１、受信したパケットを一時的に蓄積する受信バッファ３２、パケットロスの発生を検出するパケットロス検出部４１０、パケットロスが発生したことを示すパケットロスフラグを検出するパケットロスフラグ検出部４２０、パケットロスコンシールメント４３０、音声波形復号部４４０、出力音声を選択するスイッチ部４６０、スイッチを制御するＯＲ部４５０から構成される。受信した音声パケットＭは、パケット受信部３１で受信され（Ｓ３１）、受信バッファ部３２に蓄積される（Ｓ３２）。蓄積された音声パケットに含まれる音声符号は、パケットのタイムスタンプの順番に従って音声波形復号部４４０で音声波形に復号される（Ｓ４４０）。スイッチ部４６０は、通常時は音声波形復号部４４０側にセットされており（Ｓ４６１）、音声波形復号部４４０の出力音声が、音声信号受信部の出力として出力される。 The packetizing unit 170 generates a single packet by combining the speech code output from the speech waveform encoding unit 160 and the packet loss flag information output from the packet loss flag generating unit 140 (S170). The unit 23 transmits the voice packet M (S23).
FIG. 8 shows a configuration example of the audio signal receiving unit in the present invention. FIG. 9 shows a processing flow of the audio signal receiving unit. The audio signal receiving unit 400 includes a packet receiving unit 31 that receives audio packets, a reception buffer 32 that temporarily stores received packets, a packet loss detecting unit 410 that detects the occurrence of packet loss, and the occurrence of packet loss. A packet loss flag detection unit 420 that detects a packet loss flag shown, a packet loss concealment 430, a speech waveform decoding unit 440, a switch unit 460 that selects output speech, and an OR unit 450 that controls the switch. The received voice packet M is received by the packet receiving unit 31 (S31) and stored in the reception buffer unit 32 (S32). The voice code included in the accumulated voice packet is decoded into a voice waveform by the voice waveform decoding unit 440 in accordance with the order of the time stamp of the packet (S440). The switch unit 460 is normally set on the voice waveform decoding unit 440 side (S461), and the output voice of the voice waveform decoding unit 440 is output as the output of the voice signal receiving unit.

パケットロス検出部４１０は、音声パケットの受信状態を監視し（Ｓ４１１）、パケットロスが発生したか発生していないかを判断する（Ｓ４１２）。パケットロスフラグ検出部４２０は、音声パケットに組み込まれたパケットロスフラグを監視し（Ｓ４２１）、それがパケットロス状態を示しているかどうかを判断する（Ｓ４２２）。ＯＲ部４５０は、パケットロス検出部４１０によってパケットロスが検出されたか、パケットロスフラグ検出部４２０において当該パケットからパケットロス状態を示すパ
ケットロスフラグが検出されたかのいずれかの場合に、スイッチ部４６０をパケットロスコンシールメント側に切り替える（Ｓ４６２）。 The packet loss detection unit 410 monitors the reception state of the voice packet (S411), and determines whether or not a packet loss has occurred (S412). The packet loss flag detection unit 420 monitors the packet loss flag incorporated in the voice packet (S421), and determines whether it indicates a packet loss state (S422). The OR unit 450 switches the switch unit 460 when either the packet loss detection unit 410 detects a packet loss or the packet loss flag detection unit 420 detects a packet loss flag indicating a packet loss state from the packet. Switch to the packet loss concealment side (S462).

パケットロスコンシールメント４３０は、図３に示した従来法と同様の方法でよい（Ｓ４３０）。つまり、紛失信号生成部３４、出力音声バッファ３５およびピッチ抽出部３６から構成されればよい。なお、パケットロスコンシールメントの従来技術としては、出力音声をバッファに蓄えて利用したり、入力の音声パケットＭに組み込まれた冗長な情報を抽出して利用したりする場合がある。 The packet loss concealment 430 may be the same method as the conventional method shown in FIG. 3 (S430). That is, it may be configured from the lost signal generation unit 34, the output audio buffer 35, and the pitch extraction unit 36. As conventional techniques of packet loss concealment, there are cases where output voice is stored and used in a buffer, or redundant information embedded in an input voice packet M is extracted and used.

このように本発明における多地点ミキシング部や音声信号受信部の構成を用いることによって、多地点ミキシング部までの上りにおけるパケットロスが発生したときでも、多地点ミキシング部は、「上りでパケットロスが発生した」という情報を加えるだけである。パケットロスコンシールメントの処理は、上りでパケットロスが発生した場合も、下りでパケットロスが発生した場合も、音声信号受信部(受信側端末)で実施することになる。したがって、多地点ミキシング部の負荷が高くなることなく、上りでパケットロスが発生した場合でも、聴感上の音質劣化の少ない音声を再生することができる。 As described above, by using the configuration of the multipoint mixing unit and the audio signal receiving unit in the present invention, even when packet loss in the upstream to the multipoint mixing unit occurs, the multipoint mixing unit It just adds the information that it has occurred. The packet loss concealment process is performed by the audio signal receiving unit (receiving terminal) regardless of whether a packet loss occurs in the upstream or a packet loss occurs in the downstream. Therefore, even when a packet loss occurs in the uplink without increasing the load on the multipoint mixing unit, it is possible to reproduce sound with little deterioration in sound quality on hearing.

本発明においては、多地点ミキシング部で「主たる発話地点」と判定した地点以外の上りパケットがロスした場合には、パケットロスコンシールメント処理は行われない。これは、発言していない地点のパケットがロスしても、聴感上の音質には大きな影響を与えないからである。 In the present invention, the packet loss concealment process is not performed when an uplink packet other than the point determined as the “main utterance point” by the multipoint mixing unit is lost. This is because even if a packet at a point where speech is not made is lost, the sound quality on hearing is not greatly affected.

ＩＰ通信網上で音声通信を行う利用形態が普及してきており、本発明を適用することによって、安価で信頼性の高い多地点音声通信（多地点音声通信会議）が実現できる。 Usage forms for performing voice communication on an IP communication network have become widespread, and by applying the present invention, low-cost and highly reliable multipoint voice communication (multipoint voice communication conference) can be realized.

ＩＰ通信網を介した１対１の音声通信の構成を示す図。The figure which shows the structure of the one-to-one audio | voice communication via an IP communication network. 音声信号送信部の一般的な構成例を示す図。The figure which shows the general structural example of an audio | voice signal transmission part. 音声信号受信部の一般的な構成例を示す図。The figure which shows the general structural example of an audio | voice signal receiving part. ＩＰ通信網を介した多地点音声通信の構成を示す図。The figure which shows the structure of the multipoint audio | voice communication via an IP communication network. 多地点音声通信での音声信号の流れを示す図。The figure which shows the flow of the audio | voice signal in multipoint audio | voice communication. 本発明の多地点ミキシング部の構成例を示す図。The figure which shows the structural example of the multipoint mixing part of this invention. 本発明の多地点ミキシング部の処理フローを示す図。The figure which shows the processing flow of the multipoint mixing part of this invention. 本発明の音声信号受信部の構成例を示す図。The figure which shows the structural example of the audio | voice signal receiving part of this invention. 本発明の音声信号受信部の処理フローを示す図。The figure which shows the processing flow of the audio | voice signal receiving part of this invention.

Claims

When performing multipoint voice packet communication between two or more communication devices including at least a transmission unit and one or more communication devices including at least a reception unit,
In the transmitter,
A process of generating a frame audio signal by dividing the audio signal at regular intervals called frames,
Converting the frame audio signal into an audio code, storing it in a packet and transmitting it;
In the multi-point mixing section,
Storing received packets in the receive buffer;
The process of specifying the frame number to be extracted;
A loss detection process for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer;
Decoding speech code into speech waveform;
Adding the decoded speech waveform; and
Encoding the added speech waveform to generate an added speech code;
In the receiver,
Storing received packets in the receive buffer;
The process of specifying the frame number to be extracted;
A loss detection process for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer;
A voice signal packet communication method comprising:
In the multipoint mixing section,
A packet loss flag generation process for generating a flag indicating packet loss (hereinafter referred to as a “packet loss flag”) when it is determined that a packet loss has occurred in the loss detection process;
A packetizing process for storing the packet loss flag and the added voice code in a packet and transmitting the packet;
In the receiving unit,
A flag determination process for determining whether the packet loss flag is not included in the received packet;
When it is determined in the loss detection process that a packet including a voice code corresponding to the frame number to be extracted is stored in the reception buffer, and in the flag determination process, the packet loss flag is included in the packet. If it is determined that it is not, a voice packet decoding process for extracting a voice code from the packet stored in the reception buffer and decoding it into a voice waveform to be a frame output voice signal;
When it is determined in the loss detection process that a packet including a frame audio signal corresponding to the frame number to be extracted is not stored in the reception buffer, or in the flag determination process, the packet loss flag is included in the packet If it is determined that the packet has been lost, a loss process that performs packet loss concealment, and
A voice signal packet communication method comprising: connecting and outputting frame output voice signals output from the voice packet decoding step or the loss processing step.

The voice signal packet communication method according to claim 1,
In the multipoint mixing section,
A main speaker determination process for determining a main speaker from the extracted voice codes from the plurality of transmitters,
Only when it is determined that a packet loss has occurred in the loss detection process and in the main speaker determination process, it is determined that the packet is from the transmitter that was the main speaker in the frame immediately before the occurrence of the packet loss. A voice signal packet communication method comprising: the packet loss flag generation step of generating a loss flag.

When performing multipoint voice packet communication between two or more communication devices including at least a transmission unit and one or more communication devices including at least a reception unit,
In the multi-point mixing section,
Storing received packets in the receive buffer;
The process of specifying the frame number to be extracted;
A loss detection process for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer;
Decoding speech code into speech waveform;
Adding the decoded speech waveform; and
In a multipoint mixing method for voice signal packets, the method comprising: encoding a summed voice waveform to generate a summed voice code;
In the multipoint mixing section,
A packet loss flag generation process for generating a packet loss flag when it is determined that a packet loss has occurred in the loss detection process;
A multipoint mixing method, comprising: a packetization process in which the packet loss flag and the added voice code are stored in a packet and transmitted.

A multipoint mixing method for voice signal packets according to claim 3,
In the multipoint mixing section,
A main speaker determination process for determining a main speaker from the extracted voice codes from the plurality of transmitters,
Only when it is determined that a packet loss has occurred in the loss detection process and in the main speaker determination process, it is determined that the packet is from the transmitter that was the main speaker in the frame immediately before the occurrence of the packet loss. And a packet loss flag generation process for generating a loss flag.

When performing multipoint voice packet communication between two or more communication devices including at least a transmission unit and one or more communication devices including at least a reception unit,
In the receiver,
Storing received packets in the receive buffer;
The process of specifying the frame number to be extracted;
A loss detection process for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer;
In a voice signal packet receiving method having:
In the receiving unit,
A flag determination process for determining whether a packet loss flag is included in the received packet;
When it is determined in the loss detection process that a packet including a voice code corresponding to the frame number to be extracted is stored in the reception buffer, and in the flag determination process, the packet loss flag is included in the packet. If it is determined that it is not, a voice packet decoding process for extracting a voice code from the packet stored in the reception buffer and decoding it into a voice waveform to be a frame output voice signal;
When it is determined in the loss detection process that a packet including a frame audio signal corresponding to the frame number to be extracted is not stored in the reception buffer, or in the flag determination process, the packet loss flag is included in the packet If it is determined that the packet has been lost, a loss process that performs packet loss concealment, and
A voice signal packet receiving method comprising: connecting and outputting frame output voice signals output from the voice packet decoding step or the loss processing step.

When performing multipoint voice packet communication between two or more communication devices including at least a transmission unit and one or more communication devices including at least a reception unit,
In the transmitter,
Means for generating a frame audio signal by dividing the audio signal at regular intervals called frames;
Means for converting the frame audio signal into an audio code, storing it in a packet and transmitting it;
In the multi-point mixing section,
Means for storing received packets in a reception buffer;
Means for specifying the frame number to be extracted;
Lost detection means for determining whether or not a packet including a voice code corresponding to the frame number to be extracted is stored in the reception buffer;
Means for decoding a speech code into a speech waveform;
Means for adding the decoded speech waveform;
Means for encoding the added speech waveform to generate an added speech code;
In the receiver,
Means for storing received packets in a reception buffer;
Means for specifying the frame number to be extracted;
In the voice signal packet communication system, comprising: loss detection means for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer.
In the multipoint mixing section,
A packet loss flag generating means for generating a packet loss flag when it is determined by the loss detecting means that a packet loss has occurred;
Packetizing means for storing the packet loss flag and the added voice code in a packet and transmitting the packet;
In the receiver,
Flag determining means for determining whether the packet loss flag is not included in the received packet;
When it is determined by the loss detection means that a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer, and the packet loss flag is included in the packet by the flag determination means If it is determined that the packet is not, a voice packet decoding unit that extracts a voice code from the packet stored in the reception buffer, decodes the voice code into a voice waveform, and generates a frame output voice signal;
The packet loss flag is included in the packet when the loss detection unit determines that the packet including the frame audio signal corresponding to the frame number to be taken out is not stored in the reception buffer, or the flag determination unit includes If it is determined that the packet is lost, a loss processing means for performing packet loss concealment,
A voice signal packet communication system comprising: a frame output voice signal output from the voice packet decoding means or the loss processing means.

The voice signal packet communication system according to claim 6,
In the multipoint mixing section,
A main speaker determination means for determining a main speaker from the extracted voice codes from the plurality of transmitters;
Only when it is determined that a packet loss has occurred in the loss detection means and the main speaker determination means determines that the packet is from the transmitter that was the main speaker of the frame immediately before the occurrence of the packet loss, the packet The voice signal packet communication system, comprising: the packet loss flag generation means for generating a loss flag.

Means for storing received packets in a reception buffer;
Means for specifying the frame number to be extracted;
Lost detection means for determining whether or not a packet including a voice code corresponding to the frame number to be extracted is stored in the reception buffer;
Means for decoding a speech code into a speech waveform;
Means for adding the decoded speech waveform;
A multipoint mixing apparatus for voice signal packets, comprising: means for encoding the added voice waveform to generate an added voice code;
A packet loss flag generating means for generating a packet loss flag when it is determined by the loss detecting means that a packet loss has occurred;
A multipoint mixing apparatus comprising: packetizing means for storing the packet loss flag and the added voice code in a packet and transmitting the packet.

The multipoint mixing apparatus according to claim 8, wherein
A main speaker determination means for determining a main speaker from the extracted voice codes from the plurality of transmitters;
Only when it is determined that a packet loss has occurred in the loss detection means and the main speaker determination means determines that the packet is from the transmitter that was the main speaker of the frame immediately before the occurrence of the packet loss, the packet The multipoint mixing apparatus, comprising: the packet loss flag generation means for generating a loss flag.

Means for storing received packets in a reception buffer;
Means for specifying the frame number to be extracted;
In the voice signal packet receiving apparatus, comprising: loss detection means for determining whether or not a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer.
Flag determining means for determining whether or not a packet loss flag is included in the received packet;
When it is determined by the loss detection means that a packet including a voice code corresponding to the frame number to be taken out is stored in the reception buffer, and the packet loss flag is included in the packet by the flag determination means If it is determined that the packet is not, a voice packet decoding unit that extracts a voice code from the packet stored in the reception buffer, decodes the voice code into a voice waveform, and generates a frame output voice signal;
The packet loss flag is included in the packet when the loss detection unit determines that the packet including the frame audio signal corresponding to the frame number to be taken out is not stored in the reception buffer, or the flag determination unit includes If it is determined that the packet is lost, a loss processing means for performing packet loss concealment,
A voice signal packet receiving apparatus comprising: a means for concatenating and outputting frame output voice signals output from the voice packet decoding means or the loss processing means.