TW202325008A

TW202325008A - Method for automatically switching on/off sound receiving channel of video conference and electronic device

Info

Publication number: TW202325008A
Application number: TW110146730A
Authority: TW
Inventors: 阮鈺珊; 曹淩帆
Original assignee: 宏碁股份有限公司
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-06-16

Abstract

A method for automatically switching on/off sound receiving channel of video conference and an electronic device are provided. The method for automatically switching on/off sound receiving channel of video conference includes the following steps. First, a volume information of an electronic is detected. If the volume information is greater than a threshold condition, a video image of a participant participating in the video conference of the electronic device is received. Then, whether the participant is in a speaking state is determined according to the video image. If the participant is not in the speaking state, a sound receiving channel of the video conference is switched off.

Description

Automatic switching method and electronic device for video conferencing radio

本發明是有關於一種自動切換方法及電子裝置，且特別是有關於一種視訊會議收音的自動切換方法及電子裝置。The present invention relates to an automatic switching method and an electronic device, and in particular to an automatic switching method and an electronic device for video conferencing radio.

因應遠距上班之需求，讓許多線上會議軟體的使用率愈來愈高。公司員工可透過線上會議來進行工作上的討論，以降低面對面接觸的機會。In response to the demand for remote work, the usage rate of many online conference software is increasing. Employees of the company can conduct work discussions through online meetings to reduce the chance of face-to-face contact.

考慮到與會者周遭環境對開會品質造成的影響，與會者常需要自行切換會議的麥克風為靜音或是收音模式。然而也時常發生與會者要發言時忘記將麥克風調整為收音模式，導致其他與會者沒有辦法聽到聲音的狀況；或是與會者沒有要發言時卻忘記將麥克風切換為靜音模式，導致周遭環境的聲音透過會議而讓其他與會者聽見。Considering the impact of the surrounding environment of the participants on the quality of the meeting, the participants often need to switch the microphone of the conference to mute or radio mode. However, it often happens that participants forget to switch the microphone to radio mode when they want to speak, resulting in the situation that other participants cannot hear the sound; Go through the meeting and be heard by other participants.

本發明係有關於一種視訊會議收音的自動切換方法及電子裝置，其能夠透過與會者的視訊影像判斷與會者目前是否正處於發言狀態，進而自動開/關視訊會議的收音通道。此外，本發明還進一步透過雙重的判斷步驟來協助確認目前是否由與會者本人發言。The invention relates to an automatic switching method and an electronic device for video conference audio, which can judge whether the participant is currently speaking through the video image of the participant, and then automatically turn on/off the audio channel of the video conference. In addition, the present invention further assists in confirming whether the participant speaks at present through double judging steps.

根據本發明之一方面，提出一種視訊會議收音的自動切換方法。視訊會議收音的自動切換方法包括以下步驟。首先，偵測電子裝置的音量訊息。若音量訊息大於門檻條件，則接收電子裝置之視訊會議之與會者的視訊影像。接著，根據視訊影像判斷與會者是否處於發言狀態。若與會者未處於發言狀態，則關閉視訊會議的收音通道。According to one aspect of the present invention, a method for automatically switching audio in a video conference is proposed. The method for automatic switching of video conference audio includes the following steps. Firstly, the volume information of the electronic device is detected. If the volume information is greater than the threshold condition, the video images of the participants of the video conference of the electronic device are received. Then, it is determined whether the participant is speaking according to the video image. If the participant is not speaking, the audio channel of the video conference is turned off.

根據本發明之另一方面，提出一種電子裝置。電子裝置包括收音單元、處理單元以及攝像單元。收音單元用以偵測音量訊息。處理單元用以執行視訊會議及接收音量訊息。攝像單元用以取得視訊會議之與會者的視訊影像。若處理單元判斷音量訊息大於門檻條件，則接收視訊影像，並根據視訊影像判斷與會者是否處於發言狀態。若處理單元判斷與會者未處於發言狀態，則關閉視訊會議的收音通道。According to another aspect of the present invention, an electronic device is provided. The electronic device includes a radio unit, a processing unit and a camera unit. The radio unit is used for detecting volume information. The processing unit is used for performing video conferencing and receiving volume information. The camera unit is used for obtaining video images of participants in the video conference. If the processing unit judges that the volume information is greater than the threshold condition, it receives the video image, and judges whether the participant is speaking according to the video image. If the processing unit judges that the participant is not speaking, then close the audio channel of the video conference.

為了對本發明之上述及其他方面有更佳的瞭解，下文特舉實施例，並配合所附圖式詳細說明如下：In order to have a better understanding of the above-mentioned and other aspects of the present invention, the following specific examples are given in detail with the accompanying drawings as follows:

以下將詳述本發明的各實施例，並配合圖式作為例示。除了這些詳細描述之外，本發明還可以廣泛地施行在其他的實施例中，任何所述實施例的輕易替代、修改、等效變化都包含在本發明的範圍內，並以之後的專利範圍為準。在說明書的描述中，為了使讀者對本發明有較完整的瞭解，提供了許多特定細節及實施範例；然而，這些特定細節及實施範例不應視為本發明的限制。此外，眾所周知的步驟或元件並未描述於細節中，以避免造成本發明不必要之限制。Various embodiments of the present invention will be described in detail below and illustrated with accompanying drawings. In addition to these detailed descriptions, the present invention can also be widely implemented in other embodiments, and any easy replacement, modification, and equivalent changes of any of the described embodiments are included in the scope of the present invention, and are defined in the following patent scope prevail. In the description of the specification, many specific details and implementation examples are provided in order to enable readers to have a more complete understanding of the present invention; however, these specific details and implementation examples should not be regarded as limitations of the present invention. Also, well-known steps or elements have not been described in detail in order to avoid unnecessarily limiting the invention.

請參照第1圖，其為根據本發明一實施例繪示電子裝置100的方塊圖。電子裝置100例如是筆記型電腦、桌上型電腦、平板電腦或智慧型手機。電子裝置100包括收音單元110、攝像單元120、顯示單元130及處理單元140。收音單元110為裝設於電子裝置100上的收音元件，用以接收各種聲音以偵測音量訊息，例如是麥克風。攝像單元120用以拍攝各種影像，例如是相機。顯示單元130用以顯示各種資訊，例如是顯示面板。處理單元140用以執行各種處理程序，例如是電路板、晶片、電路、電腦程式產品、或電腦可讀取記錄媒體。處理單元140包括判斷模組141、影像處理模組142、辨識模組143及收音通道切換模組144。判斷模組141用以接收收音單元110的音量訊息，並對此進行判斷程序。影像處理模組142用以對攝像單元120之影像進行影像處理。辨識模組143用以利用人工智慧演算法對處理後的影像進行影像辨識。收音通道切換模組144用以依據辨識模組143的辨識結果開啟或關閉視訊會議的收音通道。Please refer to FIG. 1 , which is a block diagram illustrating an electronic device 100 according to an embodiment of the present invention. The electronic device 100 is, for example, a notebook computer, a desktop computer, a tablet computer or a smart phone. The electronic device 100 includes a sound receiving unit 110 , a camera unit 120 , a display unit 130 and a processing unit 140 . The sound receiving unit 110 is a sound receiving element installed on the electronic device 100 for receiving various sounds to detect volume information, such as a microphone. The camera unit 120 is used for shooting various images, such as a camera. The display unit 130 is used to display various information, such as a display panel. The processing unit 140 is used to execute various processing procedures, such as a circuit board, a chip, a circuit, a computer program product, or a computer-readable recording medium. The processing unit 140 includes a judgment module 141 , an image processing module 142 , an identification module 143 and a radio channel switching module 144 . The judging module 141 is used for receiving the volume information of the sound receiving unit 110 and performing a judging process. The image processing module 142 is used for performing image processing on the image of the camera unit 120 . The recognition module 143 is used for performing image recognition on the processed image by using an artificial intelligence algorithm. The audio channel switching module 144 is used for opening or closing the audio channel of the video conference according to the identification result of the identification module 143 .

第2圖為根據本發明一實施例繪示視訊會議收音的自動切換方法的流程圖。請參照第1圖和第2圖，在步驟S110，電子裝置100執行視訊會議。使用者可例如點選電子裝置100之程式，以令處理單元140執行此視訊會議，並將視訊會議的畫面顯示於顯示單元130。FIG. 2 is a flow chart illustrating a method for automatically switching audio in a video conference according to an embodiment of the present invention. Referring to FIG. 1 and FIG. 2 , in step S110 , the electronic device 100 executes a video conference. The user can, for example, click on a program of the electronic device 100 to make the processing unit 140 execute the video conference and display the video conference image on the display unit 130 .

在步驟S120，收音單元110偵測電子裝置100的音量訊息。然後，在步驟S130，判斷模組141判斷音量訊息是否大於門檻條件。若判斷模組141判斷音量訊息大於門檻條件，則進入步驟140；若判斷模組141判斷音量訊息未大於門檻條件，則進入步驟120。於此，判斷模組141根據收音單元110之音量訊息偵測是否有輸入訊號。舉例來說，當輸入訊號大於一定的音量程度，例如輸入訊號大於5%音量的門檻條件時，即可判定目前有人正在發言(但不確定是否為與會者本人發言，或是其他與會者發言)，或是周遭環境音過大。In step S120 , the sound receiving unit 110 detects the volume information of the electronic device 100 . Then, in step S130, the judging module 141 judges whether the volume information is greater than the threshold condition. If the judging module 141 judges that the volume information is greater than the threshold condition, then enter step 140 ; if the judging module 141 judges that the volume information is not greater than the threshold condition, then enter step 120 . Here, the judging module 141 detects whether there is an input signal according to the volume information of the sound receiving unit 110 . For example, when the input signal is greater than a certain volume level, for example, when the input signal is greater than the threshold condition of 5% volume, it can be determined that someone is currently speaking (but it is not sure whether it is the participant speaking or another participant speaking) , or the ambient noise is too loud.

在步驟S140，處理單元140接收視訊會議之與會者的視訊影像。接著，在步驟S150，處理單元140根據視訊影像判斷與會者是否處於發言狀態。一實施例中，處理單元140可透過偵測視訊影像中與會者的嘴型變化或手勢動作，來判斷與會者是否處於發言狀態。具體地，處理單元140可利用人工智慧演算法對視訊影像中與會者的嘴型變化或手勢動作進行影像辨識，來判斷與會者是否處於發言狀態。舉例來說，在步驟S140中，影像處理模組142可接收視訊影像，然後於步驟S150中，先對視訊影像進行影像處理，例如是將視訊影像進行去背的處理，以提取與會者單純的人物畫面。接著，再由辨識模組143利用人工智慧演算法對此與會者單純的人物畫面進行嘴型變化或手勢動作的影像辨識，來判斷與繪者是否處於發言狀態。In step S140, the processing unit 140 receives the video images of the participants of the video conference. Next, in step S150 , the processing unit 140 determines whether the participant is speaking according to the video image. In one embodiment, the processing unit 140 can determine whether the participant is speaking by detecting the change of the participant's mouth shape or gesture in the video image. Specifically, the processing unit 140 can use an artificial intelligence algorithm to perform image recognition on the participant's mouth shape changes or gestures in the video image, so as to determine whether the participant is in a speaking state. For example, in step S140, the image processing module 142 can receive the video image, and then in step S150, first perform image processing on the video image, for example, process the video image to extract the simple Character screen. Next, the recognition module 143 uses artificial intelligence algorithms to perform image recognition of mouth shape changes or gestures on the participant's simple character picture to determine whether the participant is in a speaking state.

以下進一步描述辨識模組143利用人工智慧演算法進行訓練的方式。The manner in which the recognition module 143 uses the artificial intelligence algorithm for training is further described below.

第3圖為根據本發明一實施例繪示利用人工智慧演算法進行訓練的流程圖。請參照第1圖和第3圖，在步驟S210，首先利用攝像單元120拍攝單純只有背景的影像。接著，在步驟S220，利用攝像單元120拍攝包含背景及人物的影像。然後，在步驟S230，影像處理模組142擷取單純的人物畫面，例如透過去背的處理提取單純的人物畫面。之後，可在不同背景及/或不同人物條件下，重複執行步驟S210~S230，得到多張單純的人物畫面，這些單純的人物畫面可做為辨識模組143的影像訓練集。在步驟S240中，辨識模組143利用人工智慧演算法進行訓練，如卷積神經網路(Convolutional Neural Networks, CNN)、類神經網路結構(如Softmax函數)及/或其它合適的人工智慧演算法來進行模型的訓練。FIG. 3 is a flow chart illustrating training using an artificial intelligence algorithm according to an embodiment of the present invention. Please refer to FIG. 1 and FIG. 3 , in step S210 , the camera unit 120 is used to capture an image with only the background. Next, in step S220 , the camera unit 120 is used to capture an image including the background and the person. Then, in step S230 , the image processing module 142 captures a simple character frame, for example, extracts a simple character frame through back removal processing. Afterwards, steps S210 - S230 can be performed repeatedly under different backgrounds and/or different character conditions to obtain multiple simple character pictures, which can be used as image training sets for the recognition module 143 . In step S240, the recognition module 143 uses artificial intelligence algorithms for training, such as convolutional neural networks (Convolutional Neural Networks, CNN), similar neural network structures (such as Softmax function) and/or other suitable artificial intelligence algorithms method to train the model.

前述的不同人物條件，可包含同一人物/不同人物分別張嘴說話、閉嘴不說話、戴口罩情況下比「OK」的手勢(表示目前正在開口說話)、或是戴口罩情況下比「叉」的手勢(表示目前並未開口說話)。舉例來說，第4A圖至第4G圖繪示不同拍攝影像IMGa~IMGg的實施例，其中第4A圖至第4D圖和第4E圖至第4G圖為同一人物(如與會者H)分別在兩種不同的背景下所拍攝之影像IMGa~IMGg。第4A圖和第4E圖顯示人物張嘴說話，這些影像IMGa、IMGe可透過影像處理模組142擷取單純的人物畫面之後，由使用者標註「開始講話」並送進辨識模組143進行訓練；第4B圖和第4F圖顯示人物閉嘴不說話，這些影像IMGb、IMGf可透過影像處理模組142擷取單純的人物畫面之後，由使用者標註「結束講話」並送進辨識模組143進行訓練；第4C圖顯示人物在戴口罩情況下比「OK」的手勢，此影像IMGc可透過影像處理模組142擷取單純的人物畫面之後，由使用者標註「開始講話」並送進辨識模組143進行訓練；第4D圖和第4G圖顯示人物在戴口罩情況下比「叉」的手勢，這些影像IMGd、IMGg可透過影像處理模組142擷取單純的人物畫面之後，由使用者標註「結束講話」並送進辨識模組143進行訓練。如此一來，訓練完成後的辨識模組143即可依據人物(如與會者H)的嘴型變化M(如第4A、4B、4E、4F圖)或手勢動作G(如第4C、4D、4G圖)來判斷是否正在發言。The aforementioned different character conditions can include the same character/different characters opening their mouths to speak, shutting their mouths and not speaking, gestures like "OK" when wearing a mask (indicating that they are currently speaking), or gestures like "cross" when wearing a mask gesture (indicating that you are not speaking at the moment). For example, Fig. 4A to Fig. 4G show the embodiments of different shooting images IMGa-IMGg, wherein Fig. 4A-Fig. 4D and Fig. 4E-Fig. 4G are the same person (such as participant H) respectively Images IMGa~IMGg taken under two different backgrounds. Figures 4A and 4E show the characters opening their mouths to speak. These images IMGa and IMGe can be captured through the image processing module 142 to capture the pure character images, and then marked "start speaking" by the user and sent to the recognition module 143 for training; Figures 4B and 4F show that the characters shut their mouths and do not speak. These images IMGb and IMGf can be captured by the image processing module 142 to capture the pure character images, and then the user marks "Stop talking" and sends them to the recognition module 143 for further processing. Training; Fig. 4C shows the gesture of a character saying "OK" while wearing a mask. This image IMGc can capture a simple character picture through the image processing module 142, and then mark "start talking" by the user and send it to the recognition module Group 143 conducts training; Figures 4D and 4G show the gestures of people making "cross" gestures while wearing a mask. These images IMGd and IMGg can be captured by the image processing module 142 after capturing simple images of people, and then marked by the user "End the speech" and send it to the recognition module 143 for training. In this way, the recognition module 143 after training can be based on the mouth shape changes M (such as Figures 4A, 4B, 4E, 4F) or gestures G (such as Figures 4C, 4D, 4F) of people (such as participants H). 4G map) to determine whether it is speaking.

請參照第1圖和第2圖，在步驟S150中，當判斷與會者未處於發言狀態時，則進入步驟S160，收音通道切換模組144自動關閉視訊會議的收音通道，以免周遭環境的聲音透過視訊會議而讓其他與會者聽見；當判斷與會者處於發言狀態時，則進入步驟S170，收音通道切換模組144自動開啟視訊會議的收音通道，而不須由與會者自己調整。Please refer to Fig. 1 and Fig. 2. In step S150, when it is judged that the participant is not in the state of speaking, then enter step S160, and the audio channel switching module 144 automatically closes the audio channel of the video conference, so as to prevent the sound of the surrounding environment from passing through. When it is judged that the participant is in the state of speaking, then enter step S170, the radio channel switching module 144 automatically opens the radio channel of the video conference, without adjustment by the participant himself.

第5圖繪示視訊會議VC的一實施例。請參照第1圖、第2圖和第5圖，當判斷模組141判斷音量訊息大於門檻條件時，攝像單元120將與會者H的視訊影像傳送至影像處理模組142進行影像處理。辨識模組143對處理後的影像進行影像辨識，並依據與會者H的嘴型變化M判斷與會者H目前並未處於發言狀態，而音量訊息可能來自於與會者周遭環境的聲音。收音通道切換模組144即自動將視訊會議VC的收音通道MIC關閉而設為靜音，以免對視訊會議VC造成干擾。FIG. 5 shows an embodiment of a video conference VC. Please refer to FIG. 1 , FIG. 2 and FIG. 5 , when the judgment module 141 judges that the volume information is greater than the threshold condition, the camera unit 120 sends the video image of the participant H to the image processing module 142 for image processing. The recognition module 143 performs image recognition on the processed image, and judges that the participant H is not currently speaking according to the change M of the participant H's mouth shape, and the volume information may come from the sound of the participant's surrounding environment. The radio channel switching module 144 automatically turns off the radio channel MIC of the video conference VC and sets it as mute, so as not to cause interference to the video conference VC.

此外，另一實施例中，若與會者H臉上戴者口罩，且又忘記比出手勢，導致辨識模組143無法偵測與會者H的嘴型變化M及手勢動作G，則可於顯示單元130顯示一提示訊息，以提醒與會者H做出適當的手勢動作G。In addition, in another embodiment, if the participant H wears a mask on his face and forgets to make gestures, so that the recognition module 143 cannot detect the change M of participant H's mouth shape and the gesture G, it can be displayed on the The unit 130 displays a prompt message to remind the participant H to make an appropriate gesture G.

本發明所提出的視訊會議收音的自動切換方法及電子裝置，先透過偵測電子裝置的音量訊息，當音量訊息大於預設的門檻條件時，才接著透過與會者的視訊影像判斷音量訊息的來源是否為與會者本人。若是，則自動將視訊會議的收音通道開啟；若否，則自動關閉視訊會議的收音通道，藉此提升使用者視訊會議的體驗。The method and electronic device for automatically switching audio in a video conference proposed by the present invention first detect the volume information of the electronic device, and then determine the source of the volume information through the video images of the participants when the volume information is greater than a preset threshold condition Whether it is the participant himself or not. If so, the audio channel of the video conference is automatically turned on; if not, the audio channel of the video conference is automatically turned off, so as to improve the experience of the user video conference.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明。本發明所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作各種之更動與潤飾。因此，本發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present invention has been disclosed above with the embodiments, it is not intended to limit the present invention. Those skilled in the art of the present invention can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention should be defined by the scope of the appended patent application.

100:電子裝置 110:收音單元 120:攝像單元 130:顯示單元 140:處理單元 141:判斷模組 142:影像處理模組 143:辨識模組 144:收音通道切換模組 VC:視訊會議 G:手勢動作 H:與會者 IMGa~IMGg:影像 M:嘴型變化 MIC:收音通道 S110~S170,S210~S240:步驟 100: Electronic device 110: Radio unit 120: camera unit 130: display unit 140: processing unit 141: Judgment module 142: Image processing module 143: Identification module 144:Radio channel switching module VC: video conferencing G: Gestures H: Participants IMGa~IMGg: Image M: mouth shape change MIC: radio channel S110~S170, S210~S240: steps

第1圖為根據本發明一實施例繪示電子裝置的方塊圖；第2圖為根據本發明一實施例繪示視訊會議收音的自動切換方法的流程圖；第3圖為根據本發明一實施例繪示利用人工智慧演算法進行訓練的流程圖；第4A圖至第4G圖繪示不同拍攝影像的實施例；及第5圖繪示視訊會議的一實施例。 FIG. 1 is a block diagram illustrating an electronic device according to an embodiment of the present invention; FIG. 2 is a flow chart illustrating an automatic switching method for video conferencing radio according to an embodiment of the present invention; FIG. 3 is a flow chart illustrating training using an artificial intelligence algorithm according to an embodiment of the present invention; Figures 4A to 4G illustrate examples of different captured images; and FIG. 5 illustrates an embodiment of a video conference.

S110~S170:步驟 S110~S170: steps

Claims

An automatic switching method for audio conferencing, comprising: detecting a volume message of an electronic device; if the volume message is greater than a threshold condition, receiving a video image of a participant of a video conference on the electronic device; judging whether the participant is in a state of speaking according to the video image; and If the participant is not in the speaking state, then close the audio channel of the video conference.

The automatic switching method of video conference radio as described in request item 1 further includes: If the participant is in the speaking state, the audio channel of the video conference is opened.

The method for automatically switching audio in a video conference as described in claim item 1, wherein in the step of judging whether the participant is in the speaking state according to the video image, it is by detecting a change of the participant's mouth shape in the video image Or a gesture action to determine whether the participant is in the speaking state.

The automatic switching method of video conference radio as described in request item 3, wherein if the change of the mouth shape and the gesture of the participant cannot be detected from the video image, a prompt message is displayed on the electronic device to remind The participant makes the gesture.

The method for automatically switching audio in a video conference as described in request item 3, wherein the judgment of whether the participant is in the speaking state is performed by using an artificial intelligence algorithm to perform the change of the mouth shape or the gesture of the participant in the video image image recognition.

An electronic device comprising: a radio unit for detecting a volume message; a processing unit for performing a video conference and receiving the volume message; and a camera unit for obtaining a video image of one of the participants of the video conference; Wherein, if the processing unit judges that the volume message is greater than a threshold condition, then receive the video image, and judge whether the participant is in a state of speaking according to the video image; If the processing unit determines that the participant is not in the speaking state, then close the audio channel of the video conference.

The electronic device according to claim 6, wherein if the processing unit determines that the participant is in the speaking state, then open the audio channel of the video conference.

The electronic device according to claim 6, wherein the processing unit determines whether the participant is in the speaking state by detecting a change in the participant's mouth shape or a gesture in the video image.

The electronic device as described in claim 8, wherein if the processing unit cannot detect the change of the mouth shape and the gesture of the participant from the video image, then make a display unit display a prompt message to remind the participant make the gesture.

The electronic device as described in claim 8, wherein the processing unit uses an artificial intelligence algorithm to perform image recognition on the change of the participant's mouth shape or the gesture in the video image to determine whether the participant is in the speech state.