US20230017401A1 - Speech Activity Detection Using Dual Sensory Based Learning - Google Patents
Speech Activity Detection Using Dual Sensory Based Learning Download PDFInfo
- Publication number
- US20230017401A1 US20230017401A1 US17/946,890 US202217946890A US2023017401A1 US 20230017401 A1 US20230017401 A1 US 20230017401A1 US 202217946890 A US202217946890 A US 202217946890A US 2023017401 A1 US2023017401 A1 US 2023017401A1
- Authority
- US
- United States
- Prior art keywords
- audio
- user
- video
- processor
- adjusting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title abstract description 75
- 230000009977 dual effect Effects 0.000 title abstract description 68
- 230000001953 sensory effect Effects 0.000 title abstract description 68
- 230000000694 effects Effects 0.000 title description 7
- 238000004891 communication Methods 0.000 claims description 115
- 238000000034 method Methods 0.000 claims description 37
- 238000012544 monitoring process Methods 0.000 claims description 4
- 230000001815 facial effect Effects 0.000 abstract description 16
- 238000012986 modification Methods 0.000 description 19
- 230000004048 modification Effects 0.000 description 19
- 230000006870 function Effects 0.000 description 15
- 238000012549 training Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 13
- 238000012545 processing Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000001629 suppression Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000002238 attenuated effect Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004378 air conditioning Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010438 heat treatment Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000009423 ventilation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/751—Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
- G06V40/165—Detection; Localisation; Normalisation using facial parts and geometric relationships
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
- H04M3/569—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants using the instant speaker's algorithm
Definitions
- Video conferencing enables participants to conduct a meeting from several physical locations at once. Participants are not required to be in the same room to hold a meeting. Participants may join a video conference using readily available communication devices. For example, participants may join the video conference using a laptop or desktop computer, a mobile device, or other smart device. The video conference functions similarly to a traditional telephone conference call with added video of each of the participants.
- FIGS. 1 A, 1 B, 1 C, and 1 D are diagrams of a user's head and lip positions relative to a communication device.
- FIG. 2 is a flow diagram of an embodiment of a method for dual sensory input speech detection.
- FIG. 3 is a flow diagram of an embodiment of a method for training a dual sensory input speech detection system.
- FIG. 4 is diagram of an embodiment of a video conferencing system.
- FIG. 5 is a diagram of an embodiment of a communications device.
- FIG. 6 is a diagram of an embodiment of a system suitable for implementing one or more embodiments disclosed herein.
- echo cancellation techniques are employed to remove double talk echo effects when participants are unmuted.
- noise suppression techniques are applied to clean the speech signal by suppressing noise from speech. Many noise suppression techniques lower the quality of the audio signal to 16 kHz (kilohertz) band thus reducing sound fidelity. Further, background noise suppression is usually a server centric process, resulting in less privacy and security for the participants in the video conference.
- a dual sensory input speech detection system is disclosed herein that is applied to the video conferencing system.
- Background noise suppression techniques may be applied to detect and extract speech captured by an audio capture device, such as a microphone, and spatial image recognition may be used to detect lip movement of a participant using an image capture device, such as a camera.
- Correlating the speech from video and speech from audio may be used to apply a dynamic microphone attenuation filter to mute or unmute a participant's microphone to avoid unwanted background noise when the participant fails to mute their microphone.
- muting or unmuting a user's microphone or audio capture device describes muting or unmuting the user's audio input to the video conference. The user's microphone or audio capture device may be listening to the user, but not transmitting to the video conference in order to detect the user's speech prior to transmitting the audio to the video conference.
- the dual sensory input speech detection system combines visual cues for lip movement and audio cues of human speech to control a microphone attenuation gate for a duration that is learnt through training. Based on a model developed through training, the speech detection algorithm can be enhanced and thus microphone mute/unmute/attenuation states can be controlled automatically. The approach may be used to control the microphone based on the participant's speech. Further, based on learning the participants speech, the model can be used to avoid triggering the participant's microphone in response to other people talking near the participant.
- the dual sensory input speech detection system relies on audio and video monitoring.
- the audio may be processed to extract only the speech of the participant.
- the participant's speech may be extracted from babble noise or any other non-speech sounds.
- the video may be utilized to extract spatial lip geometry of the participant.
- a machine learning model with an input of lip geometry change over time mapped to the extracted speech signal may be utilized to accurately predict that the participant is talking. By combining these two signals (audio and video) the machine learning model may accurately predict the intent of the participant and thus adjust the microphone output gain.
- the gain function of the microphone is tuned using the machine learning model to create the dual sensory input speech detection system.
- the microphone state is not merely mute or unmute, but a ramp function from mute to unmute or unmute to mute.
- the dual sensory input speech detection system may be implemented at source of audio and video provided to the video conference (e.g., the participant's communication device).
- the user device may be a desktop computer, mobile device, browser on a computer, mobile browser, etc.
- the audio may be processed in real time i.e., at 16 khz audio rate.
- the facial image extraction and processing may be performed at 20 frames per second (fps) on mobile and desktop devices including browsers. Other frame rates for image extraction and processing may be achieved based upon hardware and software capabilities of the communication device.
- the dual sensory input speech detection system may be implemented at a server or other intermediary device between participants in the video conference.
- the dual sensory input speech detection system may be implemented at an endpoint of the video conference (e.g., at a user device at the receiving end of the video conference). In still another embodiment, aspects of the dual sensory input speech detection system may be implemented in a distributed fashion between an intermediary device and user devices.
- the combination of speech and lip/mouth state gives a high degree of certainty in predicting whether or not a participant is speaking.
- the time lag due to the prediction needs to be captured and accommodated for smooth speech in the video conference.
- the audio sampling rate dictates the allowable maximum latency of the prediction.
- lip landmark identification may occur at 22 fps and speech detection may occur within 10 milliseconds (ms).
- ms milliseconds
- the dual sensory input speech detection system can be setup to activate the participants microphone based on voice activity detection within 50 ms based on the lip landmark identification and speech detection.
- a silence or pause period can be biased with voice activity detection along with dual sensory input speech detection system prediction to attenuate the microphone in steps before completely muting the participant for prolonged silence.
- the microphone muted to unmuted or unmuted to muted step can be on a ramp function based on the intrinsic latency within 10-20 ms. The combination of voice activity detection along with dual sensory input speech detection output allows for a smooth audio signal and continuous single party conversation in a video conference.
- the dual sensory input speech detection system may be applied to a two-factor authentication system.
- the first factor may be facial recognition while the second factor may be a spoken password or phrase.
- the dual sensory input speech detection system detects the spoken password or phrase and correlates the sound to the movement of the lips to ensure that an actual person is attempting to authenticate using the two-factor authentication system.
- the dual sensory input speech detection system may be configured to ensure a recording of a user's voice is not used with a still photograph.
- FIGS. 1 A- 1 D represent a variety of head and lip positions relative to a communication device 120 .
- FIG. 1 A is a diagram of a user 110 facing directly at an image capture device 130 associated with a communication device 120 .
- image capture device 130 which may be a camera, may be integrated with the communication device 120 .
- the image capture device 130 may be a separate external device to communication device 120 .
- the image capture device 130 may include an audio capture device 135 , which may be a microphone.
- a separate audio capture device 135 may be integrated with the communication device 120 or may be a separate external device to communication device 120 .
- user 110 may be facing in a direction 140 facing at the communication device 120 .
- Direction 140 may be substantially perpendicular to communication device 120 such that user 110 is facing at communication device 120 .
- FIG. 1 B is a diagram of the user 110 facing away from communication device 120 in direction 150 . While direction 150 is approximately forty-five degrees off perpendicular, any direction not facing substantially directly at the communication device 120 may be referred to herein as facing away from the communication device 120 .
- FIG. 1 C is a diagram of a user 110 with closed lips 160 .
- FIG. 1 D is a diagram of a user 110 with open lips 170 .
- the communication device 120 may be configured to determine the position of a user's lips using image capture device 130 .
- the communication device 120 may be further configured to detect whether a user 110 is facing at the image capture device 130 .
- the communication device 120 may adjust the gain of an audio capture device. In this way, the communication device 120 may mute or unmute the user based upon both the lip position and the orientation of the user's face and head.
- FIG. 2 is a flow diagram of an embodiment of a method 200 for dual sensory input speech detection.
- dual sensory input speech detection system may begin monitoring a user, for example, a participant in a video conference.
- the dual sensory input speech detection system includes a video capture device (e.g., video capture device 130 ) and an audio capture device (e.g., audio capture device 135 ).
- the dual sensory input speech detection system may include a training routine to improve identification of characteristics of the user. For example, the user's lip and mouth position as well as the position of the head relative to the image capture device may be observed. The observed lip and mouth characteristics may be used in future decision making by the dual sensory input speech detection system.
- the training at block 215 may also include capturing audio of the user speaking by the audio capture device.
- the captured audio of the user may be used in future decision making by the dual sensory input speech detection system. Training of the dual sensory input speech detection system is described in greater detail in conjunction with FIG. 3 . In some cases, training for a specific user may not be accomplished, and a generic user profile may be used by the dual sensory input speech detection system for detecting audio and video characteristics of the user.
- the dual sensory input speech detection system receives audio input and video images the user.
- the dual sensory input speech detection system may monitor the position of the user's lips and mouth, the position of the user's head relative to the image capture device, and the audio characteristics in the vicinity of the audio capture device.
- the dual sensory input speech detection system determines whether the user is in a speaking state.
- a variety of conditions may lead to a determination that the user is in a speaking state.
- a speaking state may include the user speaking while facing the video capture device. The user's lips may indicate that the user is talking, and the audio capture device may detect that the user is speaking leading to a determination that the user is in a speaking state.
- the user may be speaking and looking away from the video capture device, in this case the user may be considered not in a speaking state.
- determining whether the user is in a speaking state may happen at near real time while the user is participating in the video conference.
- determining whether the user is in a speaking state may happen periodically or may be triggered based upon changes in the video image and/or audio input.
- the dual sensory input speech detection system may adjust the audio input based on the received video image and the audio input. Adjusting the audio input may include adjusting the attenuation or gain of the received audio input. Adjusting the audio input may include muting or unmuting of the audio input.
- the dual sensory input speech detection system may determine to adjust the gain or attenuation (e.g., unmute) of the audio capture device based on whether or not the user is in a speaking state. If the user is facing away from the video capture device, the dual sensory input speech detection system may adjust the audio capture gain downward to mute the user (e.g., the user is likely talking to someone off camera).
- the dual sensory input speech detection system may determine a condition exists to change the gain of the audio capture device. In this case, the dual sensory input speech detection system may adjust the audio capture gain downward to mute the user.
- the foregoing examples are meant to be illustrative of several conditions to change the gain of the audio capture device, but not meant to be an exhaustive list.
- the dual sensory input speech detection system may gradually increase the gain in a linear manner rather than immediately changing the gain from zero to full gain.
- the audio input may be communication to an audio output common to the video conference.
- the audio input received at the user device may be transmitted to each of the participants in the video conference.
- the audio output may be the attenuated audio input, an unaltered audio input or some other version of the received audio input. If a user is muted by the system, no audio output may be provided to the video conference participants.
- a participant may be provided with options for interacting with the video conference and the dual sensory input speech detection via a user interface.
- the participant may be provided with options for refining the functionality of the dual sensory input speech detection system, for example, characteristics of the audio capture and characteristics of the video capture.
- the participant may also be provided with options for overriding the dual sensory input speech detection system via the user interface.
- FIG. 3 is a flow diagram of an embodiment of a method 300 for training a dual sensory input speech detection system.
- the dual sensory input speech detection system may enter a training mode.
- the dual sensory input speech detection system may prompt a user to speak at block 320 .
- the dual sensory input speech detection system may use a display of a communication device (e.g., communication device 120 ), to provide text for the user to read.
- the dual sensory input speech detection system may use the audio of the user to train itself.
- training the dual sensory input speech detection system may be accomplished in one embodiment by storing, comparing, and analyzing audio captured in various training scenarios. Training may enable the dual sensory input speech detection system to differentiate between the user's voice and background noise.
- Background noise may include any combination of audio sources other than the user's voice, for example lawn mowers, heating, ventilation and air conditioning (HVAC) systems, barking dogs, etc.
- HVAC heating, ventilation and air conditioning
- the dual sensory input speech detection system may also use the audio of the user to train itself to differentiate between the user's voice and other voices.
- the dual sensory input speech detection system may store audio characteristics of the user for use in detecting conditions for changing the gain of the audio capture device.
- the dual sensory input speech detection system may also be capturing video characteristics of the user speaking and not speaking.
- the dual sensory input speech detection system may identify lip landmarks of the user that may be used in identifying when the user is speaking and when the user is not speaking. Fiducial points may be identified surrounding the lips that may be used to identify the position of the lips by the dual sensory input speech detection system.
- the dual sensory input speech detection system may also determine mouth ratios by comparing a width to height of the mouth in various positions. For example, as the height of the mouth decreases the mouth ratio will increase in value. Thus, a high value of the mouth ratio may indicate that the user is not speaking or has closed their mouth. Lip and mouth characteristics of the user may be stored by the dual sensory input speech detection system at block 340 .
- the dual sensory input speech detection system may prompt the user to look away from the image capture device.
- the dual sensory input speech detection system may collect lip and mouth characteristics of the user in this alternate position and store those values at block 360 .
- FIG. 4 is diagram of an embodiment of a video conferencing system 400 .
- the video conferencing system 400 includes a first communication device 410 , a second communication device 430 , and a host device 420 .
- the dual sensory input speech detection system may be implemented locally at the first communication device 410 and the second communication device 430 , or remotely at the host device 420 , or at some combination of remote and local.
- Implementing the dual sensory input speech detection system locally at each of the communication devices provides additional privacy and security for the users of the devices. For example, all processing of audio and video may be handled at the communication device of the user and encrypted prior to leaving the user's communication device.
- the video conference may be encrypted locally and the host device 420 handles only encrypted data.
- implementing the dual sensory input speech detection system remotely at the host device 420 may reduce the processing load at the communication devices at the expense of privacy of the users.
- the video conference would need to be evaluated by the host device 420 . This may result in unencrypted transfer of the video conference to the host device 420 where it would be more susceptible to being compromised or intercepted by malevolent third parties.
- the various methods or operations described herein may be implemented by a communications device (e.g., user equipment (UE), network nodes, terminal equipment (TE), etc.).
- a communications device e.g., user equipment (UE), network nodes, terminal equipment (TE), etc.
- An example of a communications device is described below with regard to FIG. 5 .
- the communications device 3200 may comprise a two-way wireless communication device having voice and data communication capabilities. In some embodiments, voice communication capabilities are optional.
- the communications device 3200 may have the capability to communicate with other computer systems on the Internet.
- the communications device 3200 may be referred to as a data messaging device, a two-way pager, a wireless e-mail device, a cellular telephone with data messaging capabilities, a wireless internet appliance, a wireless device, a smart phone, a mobile device, or a data communication device, as examples.
- the communications device 3200 may incorporate a communication subsystem 3211 , including a receiver 3212 and a transmitter 3214 , as well as associated components such as one or more antenna elements 3216 and 3218 , local oscillators (LOs) 3213 , and a processing module such as a digital signal processor (DSP) 3220 .
- the particular design of the communication subsystem 3211 may be dependent upon the communication network 3219 in which the communications device 3200 is intended to operate.
- Network access may also vary depending upon the type of communication network 3219 .
- network access is associated with a subscriber or user of the communications device 3200 .
- the communications device 3200 may use a universal mobile telecommunications system (USIM) or embedded universal integrated circuit card (eUICC) in order to operate on a network.
- USIM/eUICC interface 3244 is typically similar to a card slot into which a USIM/eUICC card may be inserted.
- the USIM/eUICC card may have memory and may hold many key configurations 3251 and other information 3253 , such as identification and subscriber-related information.
- the communications device 3200 may send and receive communication signals over the communication network 3219 .
- the communication network 3219 may comprise multiple base stations communicating with the communications device 3200 .
- Signals received by antenna element 3216 through communication network 3219 are input to receiver 3212 , which may perform such common receiver functions such as signal amplification, frequency down conversion, filtering, channel selection, and the like.
- Analog to digital (ND) conversion of a received signal allows more complex communication functions, such as demodulation and decoding to be performed in the DSP 3220 .
- signals to be transmitted are processed, including modulation and encoding for example, by DSP 3220 and are input to transmitter 3214 for digital to analog (D/A) conversion, frequency up conversion, filtering, amplification, and transmission over the communication network 3219 via antenna element 3218 .
- DSP 3220 not only processes communication signals but also provides for receiver and transmitter control. For example, the gains applied to communication signals in receiver 3212 and transmitter 3214 may be adaptively controlled through automatic gain control algorithms implemented in DSP 3220 .
- the communications device 3200 generally includes a processor 3238 which controls the overall operation of the device. Communication functions, including data and voice communications, are performed through communication subsystem 3211 in cooperation with the processor 3238 . Processor 3238 also interacts with further device subsystems such as the display 3222 , flash memory 3224 , random access memory (RAM) 3226 , auxiliary input/output (I/O) subsystems 3228 , serial port 3230 , one or more user interfaces such as keyboards or keypads 3232 , speaker 3234 , microphone 3236 , one or more other communication subsystems 3240 such as a short-range communications subsystem, and any other device subsystems generally designated as 3242 .
- processor 3238 controls the overall operation of the device. Communication functions, including data and voice communications, are performed through communication subsystem 3211 in cooperation with the processor 3238 . Processor 3238 also interacts with further device subsystems such as the display 3222 , flash memory 3224 , random access memory (RAM) 3226 , auxiliary input/output (
- Serial port 3230 may include a universal serial bus (USB) port or other port currently known or developed in the future.
- USB universal serial bus
- Some of the illustrated subsystems perform communication-related functions, whereas other subsystems may provide “resident” or on-device functions.
- some subsystems such as keyboard 3232 and display 3222 , for example, may be used for both communication-related functions, such as entering a text message for transmission over a communication network, and device-resident functions, such as a calculator or task list.
- Operating system software used by the processor 3238 may be stored in a persistent store such as flash memory 3224 , which may instead be a read-only memory (ROM) or similar storage element (not shown).
- ROM read-only memory
- the operating system, specific device applications, or parts thereof, may be temporarily loaded into a volatile memory such as RAM 3226 .
- Received communication signals may also be stored in RAM 3226 .
- flash memory 3224 may be constituted by different areas for both computer programs 3258 and program data storage 3250 , 3252 , 3254 , and 3256 . These different storage types indicate that each program may allocate a portion of flash memory 3224 for their own data storage use.
- Processor 3238 in addition to its operating system functions, may enable execution of software applications on the communications device 3200 .
- a predetermined set of applications that control basic operations, including at least data and voice communication applications for example, may typically be installed on the communications device 3200 during manufacturing. Other applications may be installed subsequently or dynamically.
- the computer-readable storage medium may be tangible or in a transitory/non-transitory medium such as optical (e.g., compact disc (CD), digital versatile disc (DVD), etc.), magnetic (e.g., tape), or other memory currently known or developed in the future.
- optical e.g., compact disc (CD), digital versatile disc (DVD), etc.
- magnetic e.g., tape
- other memory currently known or developed in the future.
- Software applications may be loaded onto the communications device 3200 through the communication network 3219 , an auxiliary I/O subsystem 3228 , serial port 3230 , other short-range communications subsystem(s) 3240 , or any other suitable device subsystem(s) 3242 , and installed by a user in the RAM 3226 or a non-volatile store (not shown) for execution by the processor 3238 .
- Such flexibility in application installation may increase the functionality of the communications device 3200 and may provide enhanced on-device functions, communication-related functions, or both.
- secure communication applications may enable electronic commerce functions and other such financial transactions to be performed using the communications device 3200 .
- a received signal such as a text message or web page download may be processed by the communication subsystem 3211 and input to the processor 3238 , which may further process the received signal for output to the display 3222 , or alternatively to an auxiliary I/O device 3228 .
- voice communications For voice communications, overall operation of the communications device 3200 is similar, except that received signals may typically be output to a speaker 3234 and signals for transmission may be generated by a microphone 3236 .
- Alternative voice or audio I/O subsystems such as a voice message recording subsystem, may also be implemented on the communications device 3200 .
- voice or audio signal output may be accomplished primarily through the speaker 3234
- display 3222 may also be used to provide an indication of the identity of a calling party, the duration of a voice call, or other voice call-related information, for example.
- Other device subsystems 3242 may include an image capture device, e.g., camera, for use in video conferencing or other image capture applications.
- the image capture device may be used in conjunction with the microphone 3236 for a video conference session.
- Serial port 3230 may be implemented in a personal digital assistant (PDA)-type device for which synchronization with a user's desktop computer (not shown) may be desirable, but such a port is an optional device component.
- PDA personal digital assistant
- Such a serial port 3230 may enable a user to set preferences through an external device or software application and may extend the capabilities of the communications device 3200 by providing for information or software downloads to the communications device 3200 other than through a wireless communication network 3219 .
- the alternate download path may, for example, be used to load an encryption key onto the communications device 3200 through a direct and thus reliable and trusted connection to thereby enable secure device communication.
- Serial port 3230 may further be used to connect the device to a computer to act as a modem.
- Other communications subsystems 3240 are further optional components which may provide for communication between the communications device 3200 and different systems or devices, which need not necessarily be similar devices.
- one or more other communications subsystems 3240 may include an infrared device and associated circuits and components or a BluetoothTM communication module to provide for communication with similarly enabled systems and devices.
- Other communications subsystems 3240 may further include non-cellular communications such as WI-FI, WiMAX, near field communication (NFC), BLUETOOTH, ProSe (Proximity Services) (e.g., sidelink, PC5, D2D, etc.), and/or radio frequency identification (RFID).
- the other communications subsystem(s) 3240 and/or other device subsystem(s) 3242 may also be used to communicate with auxiliary devices such as tablet displays, keyboards, or projectors.
- the communications device 3200 and other components described above might include a processing component that is capable of executing instructions related to the actions described above.
- FIG. 6 illustrates an example of a system 3300 that includes a processing component 3310 suitable for implementing one or more embodiments disclosed herein.
- the system 3300 might include network connectivity devices 3320 , random access memory (RAM) 3330 , read only memory (ROM) 3340 , secondary storage 3350 , and input/output (I/O) devices 3360 .
- RAM random access memory
- ROM read only memory
- secondary storage 3350 secondary storage
- I/O input/output
- These components might communicate with one another via a bus 3370 . In some cases, some of these components may not be present or may be combined in various combinations with one another or with other components not shown. These components might be located in a single physical entity or in more than one physical entity.
- any actions described herein as being taken by the processor 3310 might be taken by the processor 3310 alone or by the processor 3310 in conjunction with one or more components shown or not shown in the drawing, such as a digital signal processor (DSP) 3380 .
- DSP digital signal processor
- the DSP 3380 is shown as a separate component, the DSP 3380 might be incorporated into the processor 3310 .
- the processor 3310 executes instructions, codes, computer programs, or scripts that it might access from the network connectivity devices 3320 , RAM 3330 , ROM 3340 , or secondary storage 3350 (which might include various disk-based systems such as hard disk, floppy disk, or optical disk). While only one CPU 3310 is shown, multiple processors may be present. Thus, while instructions may be discussed as being executed by a processor, the instructions may be executed simultaneously, serially, or otherwise by one or multiple processors.
- the processor 3310 may be implemented as one or more CPU chips.
- the network connectivity devices 3320 may take the form of modems, modem banks, ethernet devices, universal serial bus (USB) interface devices, serial interfaces, token ring devices, wireless local area network (WLAN) devices, radio transceiver devices such as code division multiple access (CDMA) devices, Global System for Mobile communication (GSM) radio transceiver devices, universal mobile telecommunications system (UMTS) radio transceiver devices, long term evolution (LTE) radio transceiver devices, new generation radio transceiver devices, worldwide interoperability for microwave access (WiMAX) devices, and/or other well-known devices for connecting to networks.
- CDMA code division multiple access
- GSM Global System for Mobile communication
- UMTS universal mobile telecommunications system
- LTE long term evolution
- WiMAX worldwide interoperability for microwave access
- These network connectivity devices 3320 may enable the processor 3310 to communicate with the internet or one or more telecommunications networks or other networks from which the processor 3310 might receive information or to which the processor 3310 might output information.
- the network connectivity devices 3320 might also include one or more transceiver components 3325 capable of transmitting and/or receiving data wirelessly.
- the RAM 3330 might be used to store volatile data and perhaps to store instructions that are executed by the processor 3310 .
- the ROM 3340 is a non-volatile memory device that typically has a smaller memory capacity than the memory capacity of the secondary storage 3350 . ROM 3340 might be used to store instructions and perhaps data that are read during execution of the instructions. Access to both RAM 3330 and ROM 3340 is typically faster than to secondary storage 3350 .
- the secondary storage 3350 is typically comprised of one or more disk drives or tape drives and might be used for non-volatile storage of data or as an over-flow data storage device if RAM 3330 is not large enough to hold all working data. Secondary storage 3350 may be used to store programs that are loaded into RAM 3330 when such programs are selected for execution.
- the I/O devices 3360 may include liquid crystal displays (LCDs), touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, printers, video monitors, audio capture devices, video capture devices, cameras, microphones, or other well-known input/output devices.
- the transceiver component 3325 might be considered to be a component of the I/O devices 3360 instead of or in addition to being a component of the network connectivity devices 3320 .
- a dual sensory input speech detection method in a video conference comprising receiving, at a first time, a first video image input of a conference participant of the video conference and a first audio input of the conference participant; communicating the first video image input to the video conference; identifying the first video image input as a first facial image of the conference participant; determining, based on the first facial image, the first video image input indicates the conference participant is in a speaking state; identifying the first audio input as a first speech sound; determining, while in the speaking state, the first speech sound originates from the conference participant; and communicating the first audio input to an audio output for the video conference.
- a first modification of the first embodiment includes that the audio output is transmitted to other conference participants.
- a second modification of the first embodiment includes the first facial image comprises a lip image of the conference participant.
- a third modification of the first embodiment includes receiving, at a second time, a second video image input of the conference participant and a second audio input of the conference participant; communicating the second video image input of the conference participant to the video conference; identifying the second video image input as a second facial image of the conference participant; determining, based on the second facial image, the second video image input does not indicate the conference participant is in the speaking state; determining the conference participant is not in the speaking state; and adjusting the second audio input in response to determining that the conference participant is not in the speaking state.
- a fourth modification of the first embodiment includes prior to determining the conference participant is not in the speaking state, identifying the second audio input as not a speech sound of the conference participant.
- a fifth modification of the first embodiment includes adjusting the second audio input comprises not communicating the second audio input to the audio output of the video conference.
- a sixth modification of the first embodiment includes adjusting the second audio input comprises attenuating the second audio input and communicating an attenuated second audio input to the audio output of the video conference.
- a seventh modification of the first embodiment includes the method is performed by a dual sensory input speech detection system, and wherein the method further comprises training the dual sensory input speech detection system.
- An eighth modification of the first embodiment includes training the dual sensory input speech detection system comprises one or more of identifying lip landmarks of the conference participant, or calculating a mouth ratio of the conference participant.
- a ninth modification of the first embodiment includes training the dual sensory input speech detection system further comprises identifying speech characteristics of the conference participant, wherein identifying the first audio input as the first speech sound comprises comparing the first audio input to the speech characteristics of the conference participant, and wherein determining the conference participant is in the speaking state comprises comparing one or more of the lip landmarks or the mouth ratio with the first video image input.
- a tenth modification of the first embodiment includes receiving, at the first time, a third video image input of a second conference participant and a third audio input of the second conference participant; communicating the third video image input of the second conference participant to the video conference; identifying the third video image input as a third facial image of the second conference participant; determining, based on the third facial image, the third video image input does not indicate the second conference participant is in the speaking state; determining the second conference participant is not in the speaking state; and adjusting the third audio input in response to determining that the second conference participant is not in the speaking state.
- a communication device comprises a memory storing instructions; a processor coupled to the memory and configured to execute the instructions to cause the communication device to receive, at a first time, a first video image input of a conference participant of a video conference and a first audio input of the conference participant; communicate the first video image input to the video conference; identify the first video image input as a first facial image of the conference participant; determine, based on the first facial image, the first video image input indicates the conference participant is in a speaking state; identifying the first audio input as a first speech sound; determining, while in the speaking state, the first speech sound originates from the conference participant; and communicating the first audio input to an audio output for the video conference.
- a first modification of the second embodiment includes the audio output is transmitted to other conference participants.
- a second modification of the second embodiment includes the first facial image comprises a lip image of the conference participant.
- a third modification of the second embodiment includes the instructions further cause the communication device to receive, at a second time, a second video image input of the conference participant and a second audio input of the conference participant; communicate the second video image input of the conference participant to the video conference; identify the second video image input as a second facial image of the conference participant; determine, based on the second facial image, the second video image input does not indicate the conference participant is in the speaking state; determine the conference participant is not in the speaking state; and adjust the second audio input in response to determining that the conference participant is not in the speaking state.
- a fourth modification of the second embodiment includes the instructions further cause the communication device to identify the second audio input as not a speech sound of the conference participant.
- a fifth modification of the second embodiment includes the instructions further cause the communication device to not communicate the second audio input to the audio output of the video conference.
- a sixth modification of the second embodiment includes the instructions further cause the communication device to attenuate the second audio input and communicate an attenuated second audio input to the audio output of the video conference.
- a seventh modification of the second embodiment includes a dual sensory input speech detection system, and wherein the instructions further cause the communication device to train the dual sensory input speech detection system.
- An eighth modification of the second embodiment includes the instructions further cause the communication device to identify lip landmarks of the conference participant, or calculate a mouth ratio of the conference participant.
- a ninth modification of the second embodiment includes the instructions further cause the communication device to identify speech characteristics of the conference participant; compare the first audio input to the speech characteristics of the conference participant; and compare one or more of the lip landmarks or the mouth ratio with the first video image input.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Geometry (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 17/112,637 filed Dec. 4, 2020 by Shiladitya Sircar entitled, “Speech Activity Detection Using Dual Sensory Based Learning”, which is incorporated by reference herein as if reproduced in its entirety.
- Video conferencing enables participants to conduct a meeting from several physical locations at once. Participants are not required to be in the same room to hold a meeting. Participants may join a video conference using readily available communication devices. For example, participants may join the video conference using a laptop or desktop computer, a mobile device, or other smart device. The video conference functions similarly to a traditional telephone conference call with added video of each of the participants.
- For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.
-
FIGS. 1A, 1B, 1C, and 1D are diagrams of a user's head and lip positions relative to a communication device. -
FIG. 2 is a flow diagram of an embodiment of a method for dual sensory input speech detection. -
FIG. 3 is a flow diagram of an embodiment of a method for training a dual sensory input speech detection system. -
FIG. 4 is diagram of an embodiment of a video conferencing system. -
FIG. 5 is a diagram of an embodiment of a communications device. -
FIG. 6 is a diagram of an embodiment of a system suitable for implementing one or more embodiments disclosed herein. - Video conference calls with many participants typically result in a large amount of unwanted background noise although participants are not speaking. In some approaches, echo cancellation techniques are employed to remove double talk echo effects when participants are unmuted. In some approaches, noise suppression techniques are applied to clean the speech signal by suppressing noise from speech. Many noise suppression techniques lower the quality of the audio signal to 16 kHz (kilohertz) band thus reducing sound fidelity. Further, background noise suppression is usually a server centric process, resulting in less privacy and security for the participants in the video conference.
- To reduce the effects of users leaving their microphones unmuted, a dual sensory input speech detection system is disclosed herein that is applied to the video conferencing system. Background noise suppression techniques may be applied to detect and extract speech captured by an audio capture device, such as a microphone, and spatial image recognition may be used to detect lip movement of a participant using an image capture device, such as a camera. Correlating the speech from video and speech from audio may be used to apply a dynamic microphone attenuation filter to mute or unmute a participant's microphone to avoid unwanted background noise when the participant fails to mute their microphone. As used herein, muting or unmuting a user's microphone or audio capture device describes muting or unmuting the user's audio input to the video conference. The user's microphone or audio capture device may be listening to the user, but not transmitting to the video conference in order to detect the user's speech prior to transmitting the audio to the video conference.
- The dual sensory input speech detection system combines visual cues for lip movement and audio cues of human speech to control a microphone attenuation gate for a duration that is learnt through training. Based on a model developed through training, the speech detection algorithm can be enhanced and thus microphone mute/unmute/attenuation states can be controlled automatically. The approach may be used to control the microphone based on the participant's speech. Further, based on learning the participants speech, the model can be used to avoid triggering the participant's microphone in response to other people talking near the participant.
- The dual sensory input speech detection system relies on audio and video monitoring. The audio may be processed to extract only the speech of the participant. The participant's speech may be extracted from babble noise or any other non-speech sounds. The video may be utilized to extract spatial lip geometry of the participant. A machine learning model with an input of lip geometry change over time mapped to the extracted speech signal may be utilized to accurately predict that the participant is talking. By combining these two signals (audio and video) the machine learning model may accurately predict the intent of the participant and thus adjust the microphone output gain. The gain function of the microphone is tuned using the machine learning model to create the dual sensory input speech detection system. In one embodiment, the microphone state is not merely mute or unmute, but a ramp function from mute to unmute or unmute to mute. Thus, a continuous smooth adjustment of microphone state (e.g., participant input volume) can be handled dynamically.
- The dual sensory input speech detection system, in one embodiment, may be implemented at source of audio and video provided to the video conference (e.g., the participant's communication device). Implementing at the user's device provides increased privacy and security for the participant. The user device may be a desktop computer, mobile device, browser on a computer, mobile browser, etc. The audio may be processed in real time i.e., at 16 khz audio rate. The facial image extraction and processing may be performed at 20 frames per second (fps) on mobile and desktop devices including browsers. Other frame rates for image extraction and processing may be achieved based upon hardware and software capabilities of the communication device. In another embodiment, the dual sensory input speech detection system may be implemented at a server or other intermediary device between participants in the video conference. In yet another embodiment, the dual sensory input speech detection system may be implemented at an endpoint of the video conference (e.g., at a user device at the receiving end of the video conference). In still another embodiment, aspects of the dual sensory input speech detection system may be implemented in a distributed fashion between an intermediary device and user devices.
- The combination of speech and lip/mouth state gives a high degree of certainty in predicting whether or not a participant is speaking. However, from the time mouth movement to speech is correlated, the time lag due to the prediction needs to be captured and accommodated for smooth speech in the video conference. The audio sampling rate dictates the allowable maximum latency of the prediction. For example, lip landmark identification may occur at 22 fps and speech detection may occur within 10 milliseconds (ms). Thus, the dual sensory input speech detection system can be setup to activate the participants microphone based on voice activity detection within 50 ms based on the lip landmark identification and speech detection. Once activated, a silence or pause period can be biased with voice activity detection along with dual sensory input speech detection system prediction to attenuate the microphone in steps before completely muting the participant for prolonged silence. In some embodiments, the microphone muted to unmuted or unmuted to muted step can be on a ramp function based on the intrinsic latency within 10-20 ms. The combination of voice activity detection along with dual sensory input speech detection output allows for a smooth audio signal and continuous single party conversation in a video conference.
- In another implementation, the dual sensory input speech detection system may be applied to a two-factor authentication system. The first factor may be facial recognition while the second factor may be a spoken password or phrase. The dual sensory input speech detection system detects the spoken password or phrase and correlates the sound to the movement of the lips to ensure that an actual person is attempting to authenticate using the two-factor authentication system. Thus, the dual sensory input speech detection system may be configured to ensure a recording of a user's voice is not used with a still photograph.
-
FIGS. 1A-1D represent a variety of head and lip positions relative to acommunication device 120.FIG. 1A is a diagram of auser 110 facing directly at animage capture device 130 associated with acommunication device 120. In some embodiments,image capture device 130, which may be a camera, may be integrated with thecommunication device 120. In other embodiments, theimage capture device 130 may be a separate external device tocommunication device 120. Theimage capture device 130 may include anaudio capture device 135, which may be a microphone. In other embodiments, a separateaudio capture device 135 may be integrated with thecommunication device 120 or may be a separate external device tocommunication device 120. InFIG. 1A ,user 110 may be facing in adirection 140 facing at thecommunication device 120.Direction 140 may be substantially perpendicular tocommunication device 120 such thatuser 110 is facing atcommunication device 120.FIG. 1B is a diagram of theuser 110 facing away fromcommunication device 120 indirection 150. Whiledirection 150 is approximately forty-five degrees off perpendicular, any direction not facing substantially directly at thecommunication device 120 may be referred to herein as facing away from thecommunication device 120.FIG. 1C is a diagram of auser 110 withclosed lips 160.FIG. 1D is a diagram of auser 110 withopen lips 170. Thecommunication device 120 may be configured to determine the position of a user's lips usingimage capture device 130. Thecommunication device 120 may be further configured to detect whether auser 110 is facing at theimage capture device 130. Based on a combination of the position of the user's lips and whether theuser 110 is facing at theimage capture device 130, thecommunication device 120 may adjust the gain of an audio capture device. In this way, thecommunication device 120 may mute or unmute the user based upon both the lip position and the orientation of the user's face and head. -
FIG. 2 is a flow diagram of an embodiment of amethod 200 for dual sensory input speech detection. Atblock 210 dual sensory input speech detection system may begin monitoring a user, for example, a participant in a video conference. The dual sensory input speech detection system includes a video capture device (e.g., video capture device 130) and an audio capture device (e.g., audio capture device 135). Optionally, atstep 215 the dual sensory input speech detection system may include a training routine to improve identification of characteristics of the user. For example, the user's lip and mouth position as well as the position of the head relative to the image capture device may be observed. The observed lip and mouth characteristics may be used in future decision making by the dual sensory input speech detection system. As another example, the training atblock 215 may also include capturing audio of the user speaking by the audio capture device. The captured audio of the user may be used in future decision making by the dual sensory input speech detection system. Training of the dual sensory input speech detection system is described in greater detail in conjunction withFIG. 3 . In some cases, training for a specific user may not be accomplished, and a generic user profile may be used by the dual sensory input speech detection system for detecting audio and video characteristics of the user. - At
block 220, the dual sensory input speech detection system receives audio input and video images the user. The dual sensory input speech detection system may monitor the position of the user's lips and mouth, the position of the user's head relative to the image capture device, and the audio characteristics in the vicinity of the audio capture device. - At
block 230, the dual sensory input speech detection system determines whether the user is in a speaking state. A variety of conditions may lead to a determination that the user is in a speaking state. A speaking state may include the user speaking while facing the video capture device. The user's lips may indicate that the user is talking, and the audio capture device may detect that the user is speaking leading to a determination that the user is in a speaking state. In another example, the user may be speaking and looking away from the video capture device, in this case the user may be considered not in a speaking state. In an embodiment, determining whether the user is in a speaking state may happen at near real time while the user is participating in the video conference. In another embodiment, determining whether the user is in a speaking state may happen periodically or may be triggered based upon changes in the video image and/or audio input. - At
block 240, the dual sensory input speech detection system may adjust the audio input based on the received video image and the audio input. Adjusting the audio input may include adjusting the attenuation or gain of the received audio input. Adjusting the audio input may include muting or unmuting of the audio input. The dual sensory input speech detection system may determine to adjust the gain or attenuation (e.g., unmute) of the audio capture device based on whether or not the user is in a speaking state. If the user is facing away from the video capture device, the dual sensory input speech detection system may adjust the audio capture gain downward to mute the user (e.g., the user is likely talking to someone off camera). As yet another example, if the user's audio capture device is unmuted and the dual sensory input speech detection system detects that the user's lips are closed, and the audio capture device detects that the user is not speaking, the dual sensory input speech detection system may determine a condition exists to change the gain of the audio capture device. In this case, the dual sensory input speech detection system may adjust the audio capture gain downward to mute the user. The foregoing examples are meant to be illustrative of several conditions to change the gain of the audio capture device, but not meant to be an exhaustive list. The dual sensory input speech detection system may gradually increase the gain in a linear manner rather than immediately changing the gain from zero to full gain. - At
block 250, the audio input may be communication to an audio output common to the video conference. For example, the audio input received at the user device may be transmitted to each of the participants in the video conference. The audio output may be the attenuated audio input, an unaltered audio input or some other version of the received audio input. If a user is muted by the system, no audio output may be provided to the video conference participants. - While the example provided with the description of
FIG. 2 are directed to a video conference application, other applications of the dual sensory input speech detection may utilize a similar method for performing tasks associated with other applications. For example, two-factor authentication using images and speech detection, etc. In a video conference application, a participant may be provided with options for interacting with the video conference and the dual sensory input speech detection via a user interface. The participant may be provided with options for refining the functionality of the dual sensory input speech detection system, for example, characteristics of the audio capture and characteristics of the video capture. The participant may also be provided with options for overriding the dual sensory input speech detection system via the user interface. -
FIG. 3 is a flow diagram of an embodiment of amethod 300 for training a dual sensory input speech detection system. Atblock 310, the dual sensory input speech detection system may enter a training mode. The dual sensory input speech detection system may prompt a user to speak atblock 320. The dual sensory input speech detection system may use a display of a communication device (e.g., communication device 120), to provide text for the user to read. The dual sensory input speech detection system may use the audio of the user to train itself. For example, training the dual sensory input speech detection system may be accomplished in one embodiment by storing, comparing, and analyzing audio captured in various training scenarios. Training may enable the dual sensory input speech detection system to differentiate between the user's voice and background noise. Background noise may include any combination of audio sources other than the user's voice, for example lawn mowers, heating, ventilation and air conditioning (HVAC) systems, barking dogs, etc. The dual sensory input speech detection system may also use the audio of the user to train itself to differentiate between the user's voice and other voices. Atblock 330, the dual sensory input speech detection system may store audio characteristics of the user for use in detecting conditions for changing the gain of the audio capture device. - While the dual sensory input speech detection system is capturing audio characteristics, the dual sensory input speech detection system may also be capturing video characteristics of the user speaking and not speaking. The dual sensory input speech detection system may identify lip landmarks of the user that may be used in identifying when the user is speaking and when the user is not speaking. Fiducial points may be identified surrounding the lips that may be used to identify the position of the lips by the dual sensory input speech detection system. The dual sensory input speech detection system may also determine mouth ratios by comparing a width to height of the mouth in various positions. For example, as the height of the mouth decreases the mouth ratio will increase in value. Thus, a high value of the mouth ratio may indicate that the user is not speaking or has closed their mouth. Lip and mouth characteristics of the user may be stored by the dual sensory input speech detection system at
block 340. - At
block 350, the dual sensory input speech detection system may prompt the user to look away from the image capture device. The dual sensory input speech detection system may collect lip and mouth characteristics of the user in this alternate position and store those values atblock 360. -
FIG. 4 is diagram of an embodiment of avideo conferencing system 400. Thevideo conferencing system 400 includes afirst communication device 410, asecond communication device 430, and ahost device 420. The dual sensory input speech detection system may be implemented locally at thefirst communication device 410 and thesecond communication device 430, or remotely at thehost device 420, or at some combination of remote and local. Implementing the dual sensory input speech detection system locally at each of the communication devices provides additional privacy and security for the users of the devices. For example, all processing of audio and video may be handled at the communication device of the user and encrypted prior to leaving the user's communication device. In this case, monitoring of the audio and video of the user occurs locally and there is not a need for thehost device 420 to evaluate the audio and video of the users. Thus, the video conference may be encrypted locally and thehost device 420 handles only encrypted data. In other embodiments, implementing the dual sensory input speech detection system remotely at thehost device 420 may reduce the processing load at the communication devices at the expense of privacy of the users. In this case, the video conference would need to be evaluated by thehost device 420. This may result in unencrypted transfer of the video conference to thehost device 420 where it would be more susceptible to being compromised or intercepted by malevolent third parties. - The various methods or operations described herein may be implemented by a communications device (e.g., user equipment (UE), network nodes, terminal equipment (TE), etc.). An example of a communications device is described below with regard to
FIG. 5 . Thecommunications device 3200 may comprise a two-way wireless communication device having voice and data communication capabilities. In some embodiments, voice communication capabilities are optional. Thecommunications device 3200 may have the capability to communicate with other computer systems on the Internet. Depending on the exact functionality provided, thecommunications device 3200 may be referred to as a data messaging device, a two-way pager, a wireless e-mail device, a cellular telephone with data messaging capabilities, a wireless internet appliance, a wireless device, a smart phone, a mobile device, or a data communication device, as examples. - Where the
communications device 3200 is enabled for two-way communication, it may incorporate acommunication subsystem 3211, including areceiver 3212 and atransmitter 3214, as well as associated components such as one ormore antenna elements communication subsystem 3211 may be dependent upon thecommunication network 3219 in which thecommunications device 3200 is intended to operate. - Network access may also vary depending upon the type of
communication network 3219. In some networks, network access is associated with a subscriber or user of thecommunications device 3200. Thecommunications device 3200 may use a universal mobile telecommunications system (USIM) or embedded universal integrated circuit card (eUICC) in order to operate on a network. The USIM/eUICC interface 3244 is typically similar to a card slot into which a USIM/eUICC card may be inserted. The USIM/eUICC card may have memory and may hold many key configurations 3251 and other information 3253, such as identification and subscriber-related information. - When network registration or activation procedures have been completed, the
communications device 3200 may send and receive communication signals over thecommunication network 3219. As illustrated, thecommunication network 3219 may comprise multiple base stations communicating with thecommunications device 3200. - Signals received by
antenna element 3216 throughcommunication network 3219 are input toreceiver 3212, which may perform such common receiver functions such as signal amplification, frequency down conversion, filtering, channel selection, and the like. Analog to digital (ND) conversion of a received signal allows more complex communication functions, such as demodulation and decoding to be performed in theDSP 3220. In a similar manner, signals to be transmitted are processed, including modulation and encoding for example, byDSP 3220 and are input totransmitter 3214 for digital to analog (D/A) conversion, frequency up conversion, filtering, amplification, and transmission over thecommunication network 3219 viaantenna element 3218.DSP 3220 not only processes communication signals but also provides for receiver and transmitter control. For example, the gains applied to communication signals inreceiver 3212 andtransmitter 3214 may be adaptively controlled through automatic gain control algorithms implemented inDSP 3220. - The
communications device 3200 generally includes aprocessor 3238 which controls the overall operation of the device. Communication functions, including data and voice communications, are performed throughcommunication subsystem 3211 in cooperation with theprocessor 3238.Processor 3238 also interacts with further device subsystems such as the display 3222,flash memory 3224, random access memory (RAM) 3226, auxiliary input/output (I/O)subsystems 3228,serial port 3230, one or more user interfaces such as keyboards orkeypads 3232,speaker 3234,microphone 3236, one or moreother communication subsystems 3240 such as a short-range communications subsystem, and any other device subsystems generally designated as 3242. While theother communication subsystems 3240 andother device subsystems 3242 are depicted as separate components inFIG. 5 , it is to be understood thatother communication subsystems 3240 and other device subsystems 3242 (or parts thereof) may be integrated as a single component.Serial port 3230 may include a universal serial bus (USB) port or other port currently known or developed in the future. - Some of the illustrated subsystems perform communication-related functions, whereas other subsystems may provide “resident” or on-device functions. Notably, some subsystems, such as
keyboard 3232 and display 3222, for example, may be used for both communication-related functions, such as entering a text message for transmission over a communication network, and device-resident functions, such as a calculator or task list. - Operating system software used by the
processor 3238 may be stored in a persistent store such asflash memory 3224, which may instead be a read-only memory (ROM) or similar storage element (not shown). The operating system, specific device applications, or parts thereof, may be temporarily loaded into a volatile memory such asRAM 3226. Received communication signals may also be stored inRAM 3226. - As shown,
flash memory 3224 may be constituted by different areas for both computer programs 3258 andprogram data storage 3250, 3252, 3254, and 3256. These different storage types indicate that each program may allocate a portion offlash memory 3224 for their own data storage use.Processor 3238, in addition to its operating system functions, may enable execution of software applications on thecommunications device 3200. A predetermined set of applications that control basic operations, including at least data and voice communication applications for example, may typically be installed on thecommunications device 3200 during manufacturing. Other applications may be installed subsequently or dynamically. - Applications and software may be stored on any computer-readable storage medium. The computer-readable storage medium may be tangible or in a transitory/non-transitory medium such as optical (e.g., compact disc (CD), digital versatile disc (DVD), etc.), magnetic (e.g., tape), or other memory currently known or developed in the future.
- Software applications may be loaded onto the
communications device 3200 through thecommunication network 3219, an auxiliary I/O subsystem 3228,serial port 3230, other short-range communications subsystem(s) 3240, or any other suitable device subsystem(s) 3242, and installed by a user in theRAM 3226 or a non-volatile store (not shown) for execution by theprocessor 3238. Such flexibility in application installation may increase the functionality of thecommunications device 3200 and may provide enhanced on-device functions, communication-related functions, or both. For example, secure communication applications may enable electronic commerce functions and other such financial transactions to be performed using thecommunications device 3200. - In a data communication mode, a received signal such as a text message or web page download may be processed by the
communication subsystem 3211 and input to theprocessor 3238, which may further process the received signal for output to the display 3222, or alternatively to an auxiliary I/O device 3228. - For voice communications, overall operation of the
communications device 3200 is similar, except that received signals may typically be output to aspeaker 3234 and signals for transmission may be generated by amicrophone 3236. Alternative voice or audio I/O subsystems, such as a voice message recording subsystem, may also be implemented on thecommunications device 3200. Although voice or audio signal output may be accomplished primarily through thespeaker 3234, display 3222 may also be used to provide an indication of the identity of a calling party, the duration of a voice call, or other voice call-related information, for example. -
Other device subsystems 3242 may include an image capture device, e.g., camera, for use in video conferencing or other image capture applications. The image capture device may be used in conjunction with themicrophone 3236 for a video conference session. -
Serial port 3230 may be implemented in a personal digital assistant (PDA)-type device for which synchronization with a user's desktop computer (not shown) may be desirable, but such a port is an optional device component. Such aserial port 3230 may enable a user to set preferences through an external device or software application and may extend the capabilities of thecommunications device 3200 by providing for information or software downloads to thecommunications device 3200 other than through awireless communication network 3219. The alternate download path may, for example, be used to load an encryption key onto thecommunications device 3200 through a direct and thus reliable and trusted connection to thereby enable secure device communication.Serial port 3230 may further be used to connect the device to a computer to act as a modem. -
Other communications subsystems 3240, such as a short-range communications subsystem, are further optional components which may provide for communication between thecommunications device 3200 and different systems or devices, which need not necessarily be similar devices. For example, one or moreother communications subsystems 3240 may include an infrared device and associated circuits and components or a Bluetooth™ communication module to provide for communication with similarly enabled systems and devices.Other communications subsystems 3240 may further include non-cellular communications such as WI-FI, WiMAX, near field communication (NFC), BLUETOOTH, ProSe (Proximity Services) (e.g., sidelink, PC5, D2D, etc.), and/or radio frequency identification (RFID). The other communications subsystem(s) 3240 and/or other device subsystem(s) 3242 may also be used to communicate with auxiliary devices such as tablet displays, keyboards, or projectors. - The
communications device 3200 and other components described above might include a processing component that is capable of executing instructions related to the actions described above. -
FIG. 6 illustrates an example of asystem 3300 that includes aprocessing component 3310 suitable for implementing one or more embodiments disclosed herein. In addition to the processor 3310 (which may be referred to as a central processor unit or CPU), thesystem 3300 might includenetwork connectivity devices 3320, random access memory (RAM) 3330, read only memory (ROM) 3340,secondary storage 3350, and input/output (I/O)devices 3360. These components might communicate with one another via abus 3370. In some cases, some of these components may not be present or may be combined in various combinations with one another or with other components not shown. These components might be located in a single physical entity or in more than one physical entity. Any actions described herein as being taken by theprocessor 3310 might be taken by theprocessor 3310 alone or by theprocessor 3310 in conjunction with one or more components shown or not shown in the drawing, such as a digital signal processor (DSP) 3380. Although theDSP 3380 is shown as a separate component, theDSP 3380 might be incorporated into theprocessor 3310. - The
processor 3310 executes instructions, codes, computer programs, or scripts that it might access from thenetwork connectivity devices 3320,RAM 3330,ROM 3340, or secondary storage 3350 (which might include various disk-based systems such as hard disk, floppy disk, or optical disk). While only oneCPU 3310 is shown, multiple processors may be present. Thus, while instructions may be discussed as being executed by a processor, the instructions may be executed simultaneously, serially, or otherwise by one or multiple processors. Theprocessor 3310 may be implemented as one or more CPU chips. - The
network connectivity devices 3320 may take the form of modems, modem banks, ethernet devices, universal serial bus (USB) interface devices, serial interfaces, token ring devices, wireless local area network (WLAN) devices, radio transceiver devices such as code division multiple access (CDMA) devices, Global System for Mobile communication (GSM) radio transceiver devices, universal mobile telecommunications system (UMTS) radio transceiver devices, long term evolution (LTE) radio transceiver devices, new generation radio transceiver devices, worldwide interoperability for microwave access (WiMAX) devices, and/or other well-known devices for connecting to networks. Thesenetwork connectivity devices 3320 may enable theprocessor 3310 to communicate with the internet or one or more telecommunications networks or other networks from which theprocessor 3310 might receive information or to which theprocessor 3310 might output information. Thenetwork connectivity devices 3320 might also include one or more transceiver components 3325 capable of transmitting and/or receiving data wirelessly. - The
RAM 3330 might be used to store volatile data and perhaps to store instructions that are executed by theprocessor 3310. TheROM 3340 is a non-volatile memory device that typically has a smaller memory capacity than the memory capacity of thesecondary storage 3350.ROM 3340 might be used to store instructions and perhaps data that are read during execution of the instructions. Access to bothRAM 3330 andROM 3340 is typically faster than tosecondary storage 3350. Thesecondary storage 3350 is typically comprised of one or more disk drives or tape drives and might be used for non-volatile storage of data or as an over-flow data storage device ifRAM 3330 is not large enough to hold all working data.Secondary storage 3350 may be used to store programs that are loaded intoRAM 3330 when such programs are selected for execution. - The I/
O devices 3360 may include liquid crystal displays (LCDs), touch screen displays, keyboards, keypads, switches, dials, mice, track balls, voice recognizers, card readers, paper tape readers, printers, video monitors, audio capture devices, video capture devices, cameras, microphones, or other well-known input/output devices. Also, the transceiver component 3325 might be considered to be a component of the I/O devices 3360 instead of or in addition to being a component of thenetwork connectivity devices 3320. - In a first embodiment, a dual sensory input speech detection method in a video conference is provide. The method comprising receiving, at a first time, a first video image input of a conference participant of the video conference and a first audio input of the conference participant; communicating the first video image input to the video conference; identifying the first video image input as a first facial image of the conference participant; determining, based on the first facial image, the first video image input indicates the conference participant is in a speaking state; identifying the first audio input as a first speech sound; determining, while in the speaking state, the first speech sound originates from the conference participant; and communicating the first audio input to an audio output for the video conference.
- A first modification of the first embodiment includes that the audio output is transmitted to other conference participants.
- A second modification of the first embodiment includes the first facial image comprises a lip image of the conference participant.
- A third modification of the first embodiment includes receiving, at a second time, a second video image input of the conference participant and a second audio input of the conference participant; communicating the second video image input of the conference participant to the video conference; identifying the second video image input as a second facial image of the conference participant; determining, based on the second facial image, the second video image input does not indicate the conference participant is in the speaking state; determining the conference participant is not in the speaking state; and adjusting the second audio input in response to determining that the conference participant is not in the speaking state.
- A fourth modification of the first embodiment includes prior to determining the conference participant is not in the speaking state, identifying the second audio input as not a speech sound of the conference participant.
- A fifth modification of the first embodiment includes adjusting the second audio input comprises not communicating the second audio input to the audio output of the video conference.
- A sixth modification of the first embodiment includes adjusting the second audio input comprises attenuating the second audio input and communicating an attenuated second audio input to the audio output of the video conference.
- A seventh modification of the first embodiment includes the method is performed by a dual sensory input speech detection system, and wherein the method further comprises training the dual sensory input speech detection system.
- An eighth modification of the first embodiment includes training the dual sensory input speech detection system comprises one or more of identifying lip landmarks of the conference participant, or calculating a mouth ratio of the conference participant.
- A ninth modification of the first embodiment includes training the dual sensory input speech detection system further comprises identifying speech characteristics of the conference participant, wherein identifying the first audio input as the first speech sound comprises comparing the first audio input to the speech characteristics of the conference participant, and wherein determining the conference participant is in the speaking state comprises comparing one or more of the lip landmarks or the mouth ratio with the first video image input.
- A tenth modification of the first embodiment includes receiving, at the first time, a third video image input of a second conference participant and a third audio input of the second conference participant; communicating the third video image input of the second conference participant to the video conference; identifying the third video image input as a third facial image of the second conference participant; determining, based on the third facial image, the third video image input does not indicate the second conference participant is in the speaking state; determining the second conference participant is not in the speaking state; and adjusting the third audio input in response to determining that the second conference participant is not in the speaking state.
- In a second embodiment, a communication device comprises a memory storing instructions; a processor coupled to the memory and configured to execute the instructions to cause the communication device to receive, at a first time, a first video image input of a conference participant of a video conference and a first audio input of the conference participant; communicate the first video image input to the video conference; identify the first video image input as a first facial image of the conference participant; determine, based on the first facial image, the first video image input indicates the conference participant is in a speaking state; identifying the first audio input as a first speech sound; determining, while in the speaking state, the first speech sound originates from the conference participant; and communicating the first audio input to an audio output for the video conference.
- A first modification of the second embodiment includes the audio output is transmitted to other conference participants.
- A second modification of the second embodiment includes the first facial image comprises a lip image of the conference participant.
- A third modification of the second embodiment includes the instructions further cause the communication device to receive, at a second time, a second video image input of the conference participant and a second audio input of the conference participant; communicate the second video image input of the conference participant to the video conference; identify the second video image input as a second facial image of the conference participant; determine, based on the second facial image, the second video image input does not indicate the conference participant is in the speaking state; determine the conference participant is not in the speaking state; and adjust the second audio input in response to determining that the conference participant is not in the speaking state.
- A fourth modification of the second embodiment includes the instructions further cause the communication device to identify the second audio input as not a speech sound of the conference participant.
- A fifth modification of the second embodiment includes the instructions further cause the communication device to not communicate the second audio input to the audio output of the video conference.
- A sixth modification of the second embodiment includes the instructions further cause the communication device to attenuate the second audio input and communicate an attenuated second audio input to the audio output of the video conference.
- A seventh modification of the second embodiment includes a dual sensory input speech detection system, and wherein the instructions further cause the communication device to train the dual sensory input speech detection system.
- An eighth modification of the second embodiment includes the instructions further cause the communication device to identify lip landmarks of the conference participant, or calculate a mouth ratio of the conference participant.
- A ninth modification of the second embodiment includes the instructions further cause the communication device to identify speech characteristics of the conference participant; compare the first audio input to the speech characteristics of the conference participant; and compare one or more of the lip landmarks or the mouth ratio with the first video image input.
- While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods may be embodied in many other specific forms without departing from the scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
- Also, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component, whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.
Claims (36)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/946,890 US20230017401A1 (en) | 2020-12-04 | 2022-09-16 | Speech Activity Detection Using Dual Sensory Based Learning |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/112,637 US11451742B2 (en) | 2020-12-04 | 2020-12-04 | Speech activity detection using dual sensory based learning |
US17/946,890 US20230017401A1 (en) | 2020-12-04 | 2022-09-16 | Speech Activity Detection Using Dual Sensory Based Learning |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/112,637 Continuation US11451742B2 (en) | 2020-12-04 | 2020-12-04 | Speech activity detection using dual sensory based learning |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230017401A1 true US20230017401A1 (en) | 2023-01-19 |
Family
ID=78822357
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/112,637 Active US11451742B2 (en) | 2020-12-04 | 2020-12-04 | Speech activity detection using dual sensory based learning |
US17/946,890 Pending US20230017401A1 (en) | 2020-12-04 | 2022-09-16 | Speech Activity Detection Using Dual Sensory Based Learning |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/112,637 Active US11451742B2 (en) | 2020-12-04 | 2020-12-04 | Speech activity detection using dual sensory based learning |
Country Status (2)
Country | Link |
---|---|
US (2) | US11451742B2 (en) |
EP (1) | EP4009323A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021114224A1 (en) | 2019-12-13 | 2021-06-17 | 华为技术有限公司 | Voice detection method, prediction model training method, apparatus, device, and medium |
WO2022266872A1 (en) * | 2021-06-23 | 2022-12-29 | Citrix Systems, Inc. | Virtual meeting control |
US11783840B2 (en) * | 2021-10-25 | 2023-10-10 | Kyndryl, Inc. | Video conference verbal junction identification via NLP |
US20240046950A1 (en) * | 2022-08-04 | 2024-02-08 | Motorola Mobility Llc | Methods, Systems, and Devices for Spectrally Adjusting Audio Gain in Videoconference and Other Applications |
CN116582637A (en) * | 2023-05-26 | 2023-08-11 | 北京字跳网络技术有限公司 | Screen splitting method of video conference picture and related equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6154548A (en) * | 1997-09-27 | 2000-11-28 | Ati Technologies | Audio mute control signal generating circuit |
US20060133623A1 (en) * | 2001-01-08 | 2006-06-22 | Arnon Amir | System and method for microphone gain adjust based on speaker orientation |
US7081915B1 (en) * | 1998-06-17 | 2006-07-25 | Intel Corporation | Control of video conferencing using activity detection |
US20140098174A1 (en) * | 2012-10-08 | 2014-04-10 | Citrix Systems, Inc. | Facial Recognition and Transmission of Facial Images in a Videoconference |
US8903130B1 (en) * | 2011-05-09 | 2014-12-02 | Google Inc. | Virtual camera operator |
US20180276395A1 (en) * | 2017-03-21 | 2018-09-27 | International Business Machines Corporation | Skull Conduction-Based Telephonic Conversation Management |
US20190335287A1 (en) * | 2016-10-21 | 2019-10-31 | Samsung Electronics., Ltd. | Method for transmitting audio signal and outputting received audio signal in multimedia communication between terminal devices, and terminal device for performing same |
US10757529B2 (en) * | 2015-06-18 | 2020-08-25 | Nokia Technologies Oy | Binaural audio reproduction |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7020257B2 (en) | 2002-04-17 | 2006-03-28 | Texas Instruments Incorporated | Voice activity identiftication for speaker tracking in a packet based conferencing system with distributed processing |
US9154730B2 (en) * | 2009-10-16 | 2015-10-06 | Hewlett-Packard Development Company, L.P. | System and method for determining the active talkers in a video conference |
WO2012083554A1 (en) | 2010-12-24 | 2012-06-28 | Huawei Technologies Co., Ltd. | A method and an apparatus for performing a voice activity detection |
CN102682273A (en) | 2011-03-18 | 2012-09-19 | 夏普株式会社 | Device and method for detecting lip movement |
US9485459B2 (en) * | 2012-12-14 | 2016-11-01 | Biscotti Inc. | Virtual window |
US20150081550A1 (en) * | 2013-09-10 | 2015-03-19 | Scvngr, Inc. | Remote transaction processing using biometrics |
KR101685466B1 (en) * | 2014-08-28 | 2016-12-12 | 삼성에스디에스 주식회사 | Method for extending participants of video conference service |
-
2020
- 2020-12-04 US US17/112,637 patent/US11451742B2/en active Active
-
2021
- 2021-12-03 EP EP21212296.4A patent/EP4009323A1/en active Pending
-
2022
- 2022-09-16 US US17/946,890 patent/US20230017401A1/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6154548A (en) * | 1997-09-27 | 2000-11-28 | Ati Technologies | Audio mute control signal generating circuit |
US7081915B1 (en) * | 1998-06-17 | 2006-07-25 | Intel Corporation | Control of video conferencing using activity detection |
US20060133623A1 (en) * | 2001-01-08 | 2006-06-22 | Arnon Amir | System and method for microphone gain adjust based on speaker orientation |
US8903130B1 (en) * | 2011-05-09 | 2014-12-02 | Google Inc. | Virtual camera operator |
US20140098174A1 (en) * | 2012-10-08 | 2014-04-10 | Citrix Systems, Inc. | Facial Recognition and Transmission of Facial Images in a Videoconference |
US10757529B2 (en) * | 2015-06-18 | 2020-08-25 | Nokia Technologies Oy | Binaural audio reproduction |
US20190335287A1 (en) * | 2016-10-21 | 2019-10-31 | Samsung Electronics., Ltd. | Method for transmitting audio signal and outputting received audio signal in multimedia communication between terminal devices, and terminal device for performing same |
US20180276395A1 (en) * | 2017-03-21 | 2018-09-27 | International Business Machines Corporation | Skull Conduction-Based Telephonic Conversation Management |
Also Published As
Publication number | Publication date |
---|---|
US20220182578A1 (en) | 2022-06-09 |
US11451742B2 (en) | 2022-09-20 |
EP4009323A1 (en) | 2022-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11451742B2 (en) | Speech activity detection using dual sensory based learning | |
US10776073B2 (en) | System and method for managing a mute button setting for a conference call | |
US20190156847A1 (en) | Transparent near-end user control over far-end speech enhancement processing | |
US11929088B2 (en) | Input/output mode control for audio processing | |
US9094524B2 (en) | Enhancing conferencing user experience via components | |
US20170318374A1 (en) | Headset, an apparatus and a method with automatic selective voice pass-through | |
CN112911069A (en) | System and method for intelligently routing notifications of incoming voice communication requests | |
US20140079212A1 (en) | Signal processing apparatus and storage medium | |
US11763817B2 (en) | Methods, systems, and media for connecting an IoT device to a call | |
US9774743B2 (en) | Silence signatures of audio signals | |
CN101313483A (en) | Configuration of echo cancellation | |
US10236016B1 (en) | Peripheral-based selection of audio sources | |
KR20060008061A (en) | Speaking rights management device and method using voice detection and recognition in push-to-talk type mobile communication terminal | |
US9661139B2 (en) | Conversation detection in an ambient telephony system | |
CN108494954A (en) | Voice call data detection method and device, storage medium and mobile terminal | |
CN108418968A (en) | Voice call data processing method, device, storage medium and mobile terminal | |
CN108449496A (en) | Voice call data detection method, device, storage medium and mobile terminal | |
CN108449495A (en) | Voice call data processing method and device, storage medium and mobile terminal | |
US20130040694A1 (en) | Removal of user identified noise | |
US9282279B2 (en) | Quality enhancement in multimedia capturing | |
KR102505345B1 (en) | System and method for removal of howling and computer program for the same | |
US11665009B2 (en) | Responsive communication system | |
CN112911062B (en) | Voice processing method, control device, terminal device and storage medium | |
CN111510662A (en) | Network call microphone state prompting method and system based on audio and video analysis | |
CN105681527B (en) | A kind of de-noising method of adjustment and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: BLACKBERRY LIMITED, CANADA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SIRCAR, SHILADITYA;REEL/FRAME:061676/0008 Effective date: 20201203 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |