[go: up one dir, main page]

WO2019002417A1 - Sound responsive device and method - Google Patents

Sound responsive device and method Download PDF

Info

Publication number
WO2019002417A1
WO2019002417A1 PCT/EP2018/067333 EP2018067333W WO2019002417A1 WO 2019002417 A1 WO2019002417 A1 WO 2019002417A1 EP 2018067333 W EP2018067333 W EP 2018067333W WO 2019002417 A1 WO2019002417 A1 WO 2019002417A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
audio signal
real
world
determining
Prior art date
Application number
PCT/EP2018/067333
Other languages
French (fr)
Inventor
Paul MOORHEAD
Original Assignee
Kraydel Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kraydel Limited filed Critical Kraydel Limited
Publication of WO2019002417A1 publication Critical patent/WO2019002417A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Definitions

  • the present invention relates to sound responsive devices.
  • the invention relates to electronic devices for responding to real-world sounds.
  • a first aspect of the invention provides a method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;
  • a second aspect of the invention provides a sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
  • determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.
  • Said determining typically comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding. Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
  • Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
  • Said audio signal processing may comprise frequency analysis, and said determining involves determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process.
  • Said one or more frequency characteristic may comprise a spectral distribution of said audio signal.
  • Said one or more frequency characteristic may comprise a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in one or more frequency bands.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz.
  • Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.
  • Said one or more characteristic may comprise one or more bit rate characteristic.
  • Said one or more bit rate characteristic may comprise a change in bit rate.
  • Said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal.
  • Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz.
  • Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz.
  • Said one or more bit rate characteristic may comprise a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.
  • Said one or more characteristic may comprises noise floor level.
  • Said one or more characteristic may comprise the noise floor level being above a threshold level.
  • Said determining may comprise determining if said sound was rendered by a loudspeaker.
  • Said determining from said audio signal processing if said sound is a real-world sound may involves comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template.
  • the or each template may comprise a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
  • Said one or more characteristics may be derived empirically from training data.
  • Said training data may comprise data representing pairs of non-processed and corresponding processed sound samples.
  • Said one or more characteristics may be derived from said training data by machine-learning.
  • Said determining if said sound is a real-world sound may comprise determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
  • Preferred embodiments employ either one or both of the following approaches to overcome the problem outlined above: 1 ) Recognition by spatial localisation of sound sources
  • Figure 1 is a schematic diagram of a room in which a sound responsive device embodying one aspect of the invention is installed;
  • Figure 2 is a block diagram of the sound responsive device of Figure 1 ;
  • FIG 3 is a flow diagram illustrating a preferred operation of the device of Figure 1 Detailed Description of the Drawings
  • the device 10 is shown installed in a room 12.
  • the room 12 is a typical living room but this is not limiting to the invention.
  • At least one, but more typically a plurality of loudspeakers 14 are provided in the room 12.
  • the loudspeakers 14 may be part of, or connected to (via wired or wireless connection), one or more electronic device (e.g. a television, radio, audio player, media player, computer, smart speaker) that is capable of providing audio signals to the loudspeakers 14 for rendering to listeners (not shown) in the room.
  • electronic device e.g. a television, radio, audio player, media player, computer, smart speaker
  • a television 16 is shown as an example of such an electronic device.
  • Each of the loudspeakers 14 shown in Figure 1 may for example be connected to the TV 16.
  • the room may contain one or more electronic device connected to, or including, one or more loudspeakers 14.
  • the loudspeakers 14 occupy a fixed position in the room 12, or at least a position that does not change frequently.
  • the loudspeakers 14 are not part of the sound responsive device 10, although the sound responsive device 10 may have one or more loudspeakers (not shown) of its own.
  • the sound responsive device 10 is connectable (by wired or wireless connection) to one or more of the loudspeakers 14.
  • the sound responsive device 10 may comprise any electronic apparatus or system (not illustrated) that supports speech and/or sound recognition as part of its overall functionality.
  • the system/apparatus may comprise a smart speaker, or a voice-controlled TV, audio player, media player or computing device, or a monitoring system that detects sounds in its environment and responds accordingly (e.g. issues an alarm or operates itself or some other equipment accordingly, or takes any other responsive action(s)).
  • the nature of the action(s) taken by the device 10 in response to detecting a sound depends on the overall functionality of the device 10 and may also depend on the type of the detected sound. Accordingly, the device 10 is typically configured to perform classification of received sounds. This may be achieved using any conventional speech recognition and/or sound recognition techniques.
  • the device 10 may be configured to take one or more action only in response to sounds that it recognises as being of a known type as determined by the classification process.
  • the device 10 may be configured to monitor the status of its environment depending on the detected recognised sounds (without necessarily taking action, or taking action depending on the determined status).
  • the device 10 typically includes a controller 1 1 for controlling the overall operation of the device 10.
  • the controller 1 1 may comprise any suitably configured or programmed processor(s), for example a microprocessor, microcontroller or multi-core processor. Typically the controller 1 1 causes the device 10 to take whichever action(s) are required in response to detection of recognised sounds.
  • the controller 1 1 may also perform the sound classification or control the operation of a sound classification module as is convenient.
  • the device 10 is implemented using a multi-core processor running a plurality of processes, one of which may be designated as the controller and the others performing the other tasks described herein as required. Each process may be performed in software, hardware or a combination of software as is convenient. One or more hardware digital signal processors may be provided to perform one or more of the processes as is convenient and applicable.
  • the device 10 is capable of distinguishing between real-world sounds and non real- world sounds.
  • a real-world sound is a sound that is created, usually spontaneously, in the environment (which in this example comprises the room 12) in which the device 10 is located by a person, object or event in real time.
  • real-world sounds typically comprise sounds that have not been processed by any audio signal processing technique and/or that are not pre-recorded.
  • Real-world sounds may also be said to comprise sounds that have not been rendered by a loudspeaker. Examples include live human and animal utterances, including live speech and other noises, crashes, bangs, alarms, bells and so on. In the present context therefore real-world sounds may be referred to as non-processed sounds, or sounds not emanating from a loudspeaker.
  • Non real-world sounds are typically sounds that have been processed by one or more audio signal processing technique, and may comprise pre-recorded or broadcast sounds.
  • Non real-world sounds are usually rendered by a loudspeaker. Examples include sounds emanating from a TV, radio, audio or media player and so on.
  • Non real-world sounds may be referred to as processed sounds or sounds emanating from a loudspeaker.
  • the device 10 is capable of distinguishing between real-world sounds and non real- world sounds even if the sounds are of the same type, e.g. distinguishing between live speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a person in the environment and recorded speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a TV or media player.
  • live speech or other sounds e.g. coughs, sneezes or shouts
  • recorded speech or other sounds e.g. coughs, sneezes or shouts
  • the device 10 is configured to employ either one or both of the following methods to achieve the above aim: 1 ) Recognition of sounds by spatial localisation of sound sources
  • the device 10 may be used by the device 10 to determine if a detected sound is a real-world sound or a non-real-world sound.
  • the device 10 is configured to respond only sounds that it has determined to be real-world sounds.
  • FIG. 2 is a block diagram of a typical embodiment of the sound responsive device 10.
  • the device 10 comprises at least one microphone 18. Typical embodiments include two or more (4 or more is preferred) microphones 18 to facilitate determining the location of sound sources.
  • the device 10 comprises an audio signal processor 20 for receiving and processing audio signals produced by the microphones 18 in response to detecting sounds in the room 12 or other environment.
  • the audio signal processor 20 may take any convenient conventional form, being implemented in hardware, software or a combination of hardware and software. Accordingly, the audio signal processor 20 may be implemented by one or more suitably configured ASIC, FPGA or other integrated circuit, and/or a computing device with suitably programmed microprocessor(s).
  • the audio signal processor 20 may be configured to perform any one or more of the following audio signal processing functions: frequency spectrum analysis; compression artefact detection; and/or location analysis.
  • the audio signal processor 20 includes components or other means for performing the relevant audio signal processing functions, as indicated in the example of Figure 2 as 22, 24 and 26.
  • the audio signal processor 20 may be configured to perform classification of detected sounds using any convention sound and/or speech recognition techniques.
  • Location analysis involves identifying one or more locations in the environment corresponding to the source of detected sounds, i.e. spatial localisation of sound sources within the environment. In the present example, this involves determining the location of the loudspeakers 14.
  • any one or more of several known techniques may be used to locate the source of a sound in space with accuracy, for example: using differential arrival times (phase difference) at each microphone; and/or using the difference in volume level at each microphone (optionally amplified by the use of highly directionally sensitive microphones).
  • the preferred device 10 is operable in a training mode in which it learns the location of one or more non-real-world sound source in its environment. In the present example this involves determining the location of the loudspeakers 14.
  • the device 10 detects sounds using the microphones 18 (or at least two of them) and performs location analysis on the output signals of the microphones 18 to determine the location of one or more loudspeaker or other sound source.
  • each loudspeaker 14 or other sound source is operated individually (i.e. one at a time) to produce sound for detection by the device 10.
  • two or more loudspeakers 14 or other sound sources may be operated simultaneously in the training mode (for example where two or more loudspeakers 14 are driven by the same TV or other electronic device).
  • the loudspeakers 14 or other sound source may be operated to produce sounds that they would produce during normal operation, or may be operated to produce one or more test sounds.
  • the device 10 is connectable (by wired or wireless connection as is convenient) to one or more of the sound producing devices (e.g. TV, radio, media player or other device having or being connected to one or more loudspeaker 14) in the environment in order to cause them to generate the sounds during the training mode.
  • the device 10 uses test sounds for this purpose and may store test signals for sending to the sound producing devices for this purpose.
  • test signals may include full 5.1 or 7.1 sound signals to deal with environments with cinema-like loudspeaker installations.
  • the preferred device 10 is also operable in a listening mode in which it detects real-world sounds in the environment and may take one or more actions in response to detecting a real-world sound.
  • the nature of the actions may depend on a wider functionality of the device 10, or of a system or apparatus of which the device 10 is part.
  • the actions may comprise generating one or more output, for example an audio and/or visual output, and/or one or more output signal for operating one or more other device to which the device 10 is connected or of which it is part.
  • the device 10 may be connected to (or be integrated with) a TV or other electronic device and may operate the TV/electronic device depending on one or more detected sounds.
  • the device 10 may be configured to take different actions depending on what sounds are detected.
  • the device 10 itself may be provided with one or more output device (e.g. a loudspeaker, lamp, video screen, klaxon, buzzer or other alarm device or telecommunications device), which it may operate depending on what sounds are detected.
  • the device 10 upon determining that a detected sound is not a real-world sound, can ignore the detected sound, e.g. take no action in response to the detected sound.
  • the device 10 may be configured to take one or more actions in response to detecting non-real-world sounds. Typically such actions are different from those taken in response to detected real-world sounds.
  • the device 10 may be configured to take different action (including no action) for real-world sounds and non-real-world sounds even if the sounds are of the same type.
  • Limitations to the sound source localisation technique include: localising portable devices such as radios or wireless speakers which may be moved regularly; incorrectly ignoring sounds from a person positioned close to one of the locations the device 10 has determined should be ignored; and locating sound sources that are close to the device 10 (e.g. speakers built into a TV set on which the device 10 is located).
  • Such limitations can be mitigated by determining whether or not a detected sound has one or more characteristic indicating that it has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression and/or rendering via electronic amplifier and/or loudspeaker, rather than being a non-processed, or raw, real-world sound.
  • This analysis can be achieved by performing audio signal processing of the output signals produced by at least one of the
  • microphones 18 when a sound is detected. Analysis of detected sounds to differentiate between processed and non-processed sounds (and therefore between non-real-world and real-world sounds) can be performed in addition to, or instead of, the spatial localisation of sounds described above.
  • sound broadcast via TV or radio sounds produced from a CD, DVD or Blu-ray disc, or streamed media sounds have been subjected to one or more audio processes, including any one or more of the following:
  • audio signals will usually have undergone some form of audio compression, e.g. dynamic range compression.
  • audio compression e.g. dynamic range compression.
  • codec coder-decoder
  • codecs commonly use a psychoacoustic technique that relies on knowledge of how humans perceive sound. Pyschoacoustic codecs compress the sound by removing parts of the sound that humans do not pay attention to, and/or devoting fewer bits of the data stream to capturing parts of the signal which are less important to the human experience than the others. So, for example, a codec might:
  • an amplifier When decoding an encoded signal back to renderable sound, an amplifier generates a varying voltage/current to operate a loudspeaker.
  • both the amplifier's electrical characteristics, and the loudspeaker's mechanical characteristics leave an imprint on the sound being produced - often referred to as the "transfer function".
  • loudspeakers associated with a TV have a limited frequency response and yet more of the high and low frequencies will be lost.
  • the audio signal is likely to undergo encoding, compression, decoding and decompression at least once and often more than once as it passes through the various network links from initial recording, to studio to transmitter.
  • Different codecs may be used at different stages so the end result may bear traces of more than one kind of processing.
  • processed sounds commonly have one or more characteristics that non-processed real-world sounds do not have, and vice versa.
  • non-processed real-world sounds tend to include audio signal components at higher and/or lower frequencies than processes sounds such as those emanating from a television or audio system.
  • non-processed real-world sounds tend to have less inherent background noise than processed signals.
  • non-processed real-world sounds tend to have a more natural spread of frequency components than processed sounds.
  • the spectral distribution (which may be referred to as spectral power distribution) of the or each audio signal representing a detected sound can provide an indication of whether the sound is a real world sound or not.
  • the frequency distribution i.e. the distribution of the frequency components of the audio signal, and other characteristics of processed sound are detectably different from those of real-world non-processed sounds. Some of these characteristics are complex, e.g. changes in bitrates of encoding (e.g. lower bit rate after a loud noise, or for very high or low frequencies), and introduce identifiable artefacts into the processed audio signals.
  • a processed audio signal may include detectable artefacts arising from any one or more of the processes described above. Accordingly, any sound (and more particularly any corresponding audio signal representing the sound) detected by the device 10 may be analysed in respect of any one or more signal
  • the relevant characteristics include, but are not limited to:
  • the frequency content of the audio signal in particular the presence or absence of signal components in one or more frequency bands, especially a high frequency band (e.g. above 20kHz or above 500kHz) and/or a low frequency band (e.g. below 20Hz or below 50Hz).
  • a high frequency band e.g. above 20kHz or above 500kHz
  • a low frequency band e.g. below 20Hz or below 50Hz.
  • the spectral distribution of the audio signal especially within one or more frequency bands, e.g. between 20Hz and 500kHz, or between 500Hz and 2kHz, from 500Hz to 50kHz (or other frequency range, e.g. a frequency range deemed to correspond with the human voice)
  • the bitrate of the audio signal including the absolute bitrate and/or changes in bitrate.
  • this may involve detecting different bitrates being used for different frequency components (in particular relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. ⁇ 500Hz) frequency bands), and/or relatively low bitrates being used after a signal event such as loud noise (which may be referred to as a high intensity signal event).
  • relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. ⁇ 500Hz) frequency bands
  • a signal event such as loud noise
  • the noise floor level of the audio signal in particular a relatively high noise floor level (e.g. above a threshold value that can be determined from reference data).
  • a rolling window of sound (e.g. of up to a few seconds) may be captured continuously from each microphone 18, and once the trigger condition(s) has been met a sound segment of defined duration, commencing with the trigger sound may be put into a queue for analysis. Any convenient technique to implement early, random or other discard technique may be employed if the queue grows beyond acceptable limits.
  • Figure 3 shows a preferred operation of the device 10 in the listening mode.
  • the device 10 captures a sample of detected sound from the output of one or more microphone 18 in response to the trigger condition(s) being met.
  • the device 10 performs location analysis on the detected sound as described above. This may involve determining the location of the sound's source using the phase difference between corresponding signals captures from at least two microphones 18 and/or sound intensity difference between corresponding signals captures from at least two microphones 18, and may depend on the directional sensitivity of the or each relevant microphone 18. Sounds that are determined as having emanated from the location of a know loudspeaker 14 (as determined during the training mode) can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • step 303 the device 10 performs transfer function, or frequency spectrum, analysis of the detected sound to identify one or more frequency characteristics that are indicative of it being either a real- world sound or a non-real-world sound.
  • transfer function or frequency spectrum
  • this involves determining that the sound is a processed, or non-real-world, sound if it lacks high and/or low frequency components that are commonly removed by audio encoding and/or by rendering via amplifier and/or loudspeaker. Sounds that are determined as having been processed can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • the transfer function analysis may involve comparing the sound sample (conveniently a transfer function representing the sound sample) against one or more transfer function template associated with audio recording, audio broadcast and/or audio reproduction.
  • Any audio playback system will have a transfer response h(t) and corresponding frequency domain response H(s). Playing the audio source signal sig(t) through the system will convolve sig(t) with h(t), or in a frequency domain representation, multiplication of SIG(s) (being the frequency domain representation of sig(t)) with H(s).
  • the fitting technique can use any number of standard parametric techniques.
  • the transfer function for broadcast compression and blu-ray encoding can for example be used as templates. Such templates are best suited to transients such as gunshots, breaking glass or TV screams, it will be less effective for narrower band sounds such as vehicle noise or human speech that does not have significant variability (inflection or emotion).
  • step 304 the device 10 looks for artefacts in the detected sound (i.e. in the corresponding frequency spectrum and/or waveform of the corresponding audio signal) which indicate that the sound has been subjected to audio compression, e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
  • audio compression e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e
  • Sounds can be deemed to be processed or non-real-world sounds (and therefore ignored or rejected) upon being identified as such by any one of steps 302, 303 or 304, or alternatively upon being identified as such by any two or more of steps 302, 303 and 304. Any determinations made by the audio signal processor 20 in this regard may be communicated to the controller 1 1 which may make the decision on whether or not to ignore the detected sound and/or to determine which actions are to be taken in response to the detected sound.
  • the sequencing of the sequence of steps 302, 303, 304 in Figure 3 is illustrative and in alternative embodiments, these (and/or other) steps may be performed in different orders, or be merged, and/or be operated in parallel, dependent on the requirements of the application and capabilities of the device 10.
  • the device 10 combines the techniques of location analysis, spectrum analysis and artefact detection.
  • any one of these techniques may be used on its own, or in combination with any other of the techniques.
  • spectrum analysis and artefact detection in particular may each be sufficient on its own to achieve an effective level of specificity for a given use-case.
  • Training using machine-learning techniques - this may, for example, involve training the device 10 with training data which may comprise pairs of sound samples - an original sound generated live (or a very high-fidelity or artefact-free recording), and the same sound after typical encoding, compression and/or reproduction.
  • the second process does not generate an algorithm as such and it may not be apparent how the system is achieving subsequent levels of effective differentiation.
  • the machine-learning approach may also collapse steps in the processing: in other words it may not be necessary to separately look for spectrum differences and compression artefacts, a trained system may just learn the difference between processed and non-processed sounds using whatever characteristics it finds to be most capable of allowing the distinction to be made.
  • the machine learning approach may involve providing the device 10 with reference real-world and non-real-world sounds, the device 10 being configured through machine-learning to develops its own criteria empirically for distinguishing between them. These criteria may involve elements of location, spectral distribution and artefacts, and may differ for different types of input sound.
  • the device 10 is intended to monitor for coughs, sneezes, cries for help, sounds of danger and other noises but in a normal home the TV is likely to be active for several hours a day and generate many similar artificial sound events.
  • the device 10 has a plurality of microphones 18 and audio signal processing circuitry 20 configured to perform the following:
  • Measurement of the phase shift between corresponding sound samples from each (or at least two) microphone 18 Audio signal analysis of each sample, which may involve transfer function analysis and or artefact detection.
  • the device 10 determines the position of the loudspeakers 14 within the room 12, preferably by playing test signals through the television (e.g. via HDMI or other connection) and detecting the corresponding sounds rendered by the loudspeakers 14 using the microphones 18. At a minimum it is preferred that alternate left and right channel test signals are used, but more preferably test signals for 2.1 , 5.1 and 7.1 sound set-ups are used, selecting channels and frequencies as appropriate.
  • the device 10 can preform location analysis and reject sounds from the designated speaker locations. Alternately, sounds from those locations can simply be marked as "suspect" and further processed before making a final decision, for example based on weighted probabilities from each phase of analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A sound recognition method that is capable of distinguishing between real-world sounds and pre- recorded or broadcast sound by determining if the sound emanates from a designated location, such as the location of a loudspeaker, or by recognising characteristics of the sound indicating that it has been subjected to audio recording, audio broadcast and/or audio reproduction processes.

Description

Sound Responsive Device and Method
Field of the Invention The present invention relates to sound responsive devices. In particular the invention relates to electronic devices for responding to real-world sounds.
Background to the Invention Electronic devices that understand and respond to spoken commands are becoming common but issues are frequently encountered where the devices mistake audio from TV or Radio as sound from a live source.
There are also devices that attempt to classify noises in the home and respond appropriately including, for example, recognising gunfire, breaking glass, shouts etc., or even identifying coughs, sneezes, doorbells or telephones. Again, these devices may undesirably treat similar sounds from a TV program as being "real".
Summary of the Invention
A first aspect of the invention provides a method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;
producing a corresponding audio signal from the or each microphone;
performing audio signal processing on the corresponding audio signal from the or each microphone;
determining from said audio signal processing if said sound is a real-world sound;
performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
A second aspect of the invention provides a sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
Preferably determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process. Said determining typically comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding. Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
Said determining may involve determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
Said audio signal processing may comprise frequency analysis, and said determining involves determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal. Said one or more frequency characteristic may comprise a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in one or more frequency bands. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz. Said one or more frequency characteristic may comprise an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz. Said one or more characteristic may comprise one or more bit rate characteristic. Said one or more bit rate characteristic may comprise a change in bit rate. Said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz. Said one or more bit rate characteristic may comprise use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz. Said one or more bit rate characteristic may comprise a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.
Said one or more characteristic may comprises noise floor level. Said one or more characteristic may comprise the noise floor level being above a threshold level.
Said determining may comprise determining if said sound was rendered by a loudspeaker.
Said determining from said audio signal processing if said sound is a real-world sound may involves comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template. The or each template may comprise a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
Said one or more characteristics may be derived empirically from training data.
Said training data may comprise data representing pairs of non-processed and corresponding processed sound samples.
Said one or more characteristics may be derived from said training data by machine-learning.
Said determining if said sound is a real-world sound may comprise determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
Preferred embodiments employ either one or both of the following approaches to overcome the problem outlined above: 1 ) Recognition by spatial localisation of sound sources
2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction. Preferred embodiments of the invention are capable of distinguishing between real-world sounds and pre-recorded or broadcast sound.
Further advantageous features of the invention will be apparent to those ordinarily skilled in the art upon review of the following description of a specific embodiment and with reference to the accompanying drawings. Brief Description of the Drawings
An embodimet of the invention is now described by way of example and with reference to the accompanying drawings in which:
Figure 1 is a schematic diagram of a room in which a sound responsive device embodying one aspect of the invention is installed;
Figure 2 is a block diagram of the sound responsive device of Figure 1 ; and
Figure 3 is a flow diagram illustrating a preferred operation of the device of Figure 1 Detailed Description of the Drawings Referring now to Figure 1 of the drawings there is shown a sound responsive device 10 embodying one aspect of the invention. The device 10 is shown installed in a room 12. In the illustrated example the room 12 is a typical living room but this is not limiting to the invention. At least one, but more typically a plurality of loudspeakers 14 are provided in the room 12. The loudspeakers 14 may be part of, or connected to (via wired or wireless connection), one or more electronic device (e.g. a television, radio, audio player, media player, computer, smart speaker) that is capable of providing audio signals to the loudspeakers 14 for rendering to listeners (not shown) in the room. In Figure 1 , a television 16 is shown as an example of such an electronic device. Each of the loudspeakers 14 shown in Figure 1 may for example be connected to the TV 16. More generally, the room may contain one or more electronic device connected to, or including, one or more loudspeakers 14. Ideally, the loudspeakers 14 occupy a fixed position in the room 12, or at least a position that does not change frequently. In typical embodiments, the loudspeakers 14 are not part of the sound responsive device 10, although the sound responsive device 10 may have one or more loudspeakers (not shown) of its own. Advantageously the sound responsive device 10 is connectable (by wired or wireless connection) to one or more of the loudspeakers 14.
The sound responsive device 10 may comprise any electronic apparatus or system (not illustrated) that supports speech and/or sound recognition as part of its overall functionality. For example the system/apparatus may comprise a smart speaker, or a voice-controlled TV, audio player, media player or computing device, or a monitoring system that detects sounds in its environment and responds accordingly (e.g. issues an alarm or operates itself or some other equipment accordingly, or takes any other responsive action(s)). The nature of the action(s) taken by the device 10 in response to detecting a sound depends on the overall functionality of the device 10 and may also depend on the type of the detected sound. Accordingly, the device 10 is typically configured to perform classification of received sounds. This may be achieved using any conventional speech recognition and/or sound recognition techniques. The device 10 may be configured to take one or more action only in response to sounds that it recognises as being of a known type as determined by the classification process. The device 10 may be configured to monitor the status of its environment depending on the detected recognised sounds (without necessarily taking action, or taking action depending on the determined status). The device 10 typically includes a controller 1 1 for controlling the overall operation of the device 10. The controller 1 1 may comprise any suitably configured or programmed processor(s), for example a microprocessor, microcontroller or multi-core processor. Typically the controller 1 1 causes the device 10 to take whichever action(s) are required in response to detection of recognised sounds. The controller 1 1 may also perform the sound classification or control the operation of a sound classification module as is convenient. Typically the device 10 is implemented using a multi-core processor running a plurality of processes, one of which may be designated as the controller and the others performing the other tasks described herein as required. Each process may be performed in software, hardware or a combination of software as is convenient. One or more hardware digital signal processors may be provided to perform one or more of the processes as is convenient and applicable.
Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real- world sounds. In this context a real-world sound is a sound that is created, usually spontaneously, in the environment (which in this example comprises the room 12) in which the device 10 is located by a person, object or event in real time. As such, real-world sounds typically comprise sounds that have not been processed by any audio signal processing technique and/or that are not pre-recorded. Real-world sounds may also be said to comprise sounds that have not been rendered by a loudspeaker. Examples include live human and animal utterances, including live speech and other noises, crashes, bangs, alarms, bells and so on. In the present context therefore real-world sounds may be referred to as non-processed sounds, or sounds not emanating from a loudspeaker.
Non real-world sounds are typically sounds that have been processed by one or more audio signal processing technique, and may comprise pre-recorded or broadcast sounds. Non real-world sounds are usually rendered by a loudspeaker. Examples include sounds emanating from a TV, radio, audio or media player and so on. Non real-world sounds may be referred to as processed sounds or sounds emanating from a loudspeaker.
Advantageously, the device 10 is capable of distinguishing between real-world sounds and non real- world sounds even if the sounds are of the same type, e.g. distinguishing between live speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a person in the environment and recorded speech or other sounds (e.g. coughs, sneezes or shouts) emanating from a TV or media player.
In preferred embodiments the device 10 is configured to employ either one or both of the following methods to achieve the above aim: 1 ) Recognition of sounds by spatial localisation of sound sources
2) Recognising characteristics of sound that indicate that the sound has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression, decompression and/or rendering (or reproduction) via electronic amplifier and/or loudspeaker.
Either or both of the above techniques may be used by the device 10 to determine if a detected sound is a real-world sound or a non-real-world sound. In preferred embodiments, the device 10 is configured to respond only sounds that it has determined to be real-world sounds.
Figure 2 is a block diagram of a typical embodiment of the sound responsive device 10. The device 10 comprises at least one microphone 18. Typical embodiments include two or more (4 or more is preferred) microphones 18 to facilitate determining the location of sound sources. The device 10 comprises an audio signal processor 20 for receiving and processing audio signals produced by the microphones 18 in response to detecting sounds in the room 12 or other environment. The audio signal processor 20 may take any convenient conventional form, being implemented in hardware, software or a combination of hardware and software. Accordingly, the audio signal processor 20 may be implemented by one or more suitably configured ASIC, FPGA or other integrated circuit, and/or a computing device with suitably programmed microprocessor(s). In preferred embodiments the audio signal processor 20 may be configured to perform any one or more of the following audio signal processing functions: frequency spectrum analysis; compression artefact detection; and/or location analysis. The audio signal processor 20 includes components or other means for performing the relevant audio signal processing functions, as indicated in the example of Figure 2 as 22, 24 and 26. Optionally, the audio signal processor 20 may be configured to perform classification of detected sounds using any convention sound and/or speech recognition techniques.
Location analysis involves identifying one or more locations in the environment corresponding to the source of detected sounds, i.e. spatial localisation of sound sources within the environment. In the present example, this involves determining the location of the loudspeakers 14.
In preferred embodiments where the device 10 has two or more microphones 18, any one or more of several known techniques may be used to locate the source of a sound in space with accuracy, for example: using differential arrival times (phase difference) at each microphone; and/or using the difference in volume level at each microphone (optionally amplified by the use of highly directionally sensitive microphones).
The preferred device 10 is operable in a training mode in which it learns the location of one or more non-real-world sound source in its environment. In the present example this involves determining the location of the loudspeakers 14. In the training mode, the device 10 detects sounds using the microphones 18 (or at least two of them) and performs location analysis on the output signals of the microphones 18 to determine the location of one or more loudspeaker or other sound source. Preferably, in the training mode each loudspeaker 14 or other sound source is operated individually (i.e. one at a time) to produce sound for detection by the device 10. Alternatively, two or more loudspeakers 14 or other sound sources may be operated simultaneously in the training mode (for example where two or more loudspeakers 14 are driven by the same TV or other electronic device). In the training mode, the loudspeakers 14 or other sound source may be operated to produce sounds that they would produce during normal operation, or may be operated to produce one or more test sounds. In preferred embodiments, the device 10 is connectable (by wired or wireless connection as is convenient) to one or more of the sound producing devices (e.g. TV, radio, media player or other device having or being connected to one or more loudspeaker 14) in the environment in order to cause them to generate the sounds during the training mode. Advantageously, the device 10 uses test sounds for this purpose and may store test signals for sending to the sound producing devices for this purpose. For example the test signals may include full 5.1 or 7.1 sound signals to deal with environments with cinema-like loudspeaker installations. The preferred device 10 is also operable in a listening mode in which it detects real-world sounds in the environment and may take one or more actions in response to detecting a real-world sound. The nature of the actions may depend on a wider functionality of the device 10, or of a system or apparatus of which the device 10 is part. The actions may comprise generating one or more output, for example an audio and/or visual output, and/or one or more output signal for operating one or more other device to which the device 10 is connected or of which it is part. For example the device 10 may be connected to (or be integrated with) a TV or other electronic device and may operate the TV/electronic device depending on one or more detected sounds. The device 10 may be configured to take different actions depending on what sounds are detected. The device 10 itself may be provided with one or more output device (e.g. a loudspeaker, lamp, video screen, klaxon, buzzer or other alarm device or telecommunications device), which it may operate depending on what sounds are detected.
Advantageously, the device 10, upon determining that a detected sound is not a real-world sound, can ignore the detected sound, e.g. take no action in response to the detected sound. Optionally, the device 10 may be configured to take one or more actions in response to detecting non-real-world sounds. Typically such actions are different from those taken in response to detected real-world sounds. In embodiments where the device 10 is configured to classify detected sounds according to multiple sound types (e.g. speech, bangs, doorbells, telephone rings and so on), the device 10 may be configured to take different action (including no action) for real-world sounds and non-real-world sounds even if the sounds are of the same type.
Limitations to the sound source localisation technique include: localising portable devices such as radios or wireless speakers which may be moved regularly; incorrectly ignoring sounds from a person positioned close to one of the locations the device 10 has determined should be ignored; and locating sound sources that are close to the device 10 (e.g. speakers built into a TV set on which the device 10 is located). Such limitations can be mitigated by determining whether or not a detected sound has one or more characteristic indicating that it has been subjected to one or more processes, in particular processes associated with audio recording, audio broadcast and/or audio reproduction, e.g. encoding, decoding, compression and/or rendering via electronic amplifier and/or loudspeaker, rather than being a non-processed, or raw, real-world sound. This analysis can be achieved by performing audio signal processing of the output signals produced by at least one of the
microphones 18 when a sound is detected. Analysis of detected sounds to differentiate between processed and non-processed sounds (and therefore between non-real-world and real-world sounds) can be performed in addition to, or instead of, the spatial localisation of sounds described above.
For example, sound broadcast via TV or radio, sounds produced from a CD, DVD or Blu-ray disc, or streamed media sounds have been subjected to one or more audio processes, including any one or more of the following: A. Encoding
Almost all recorded and/or broadcast sound (barring analogue vinyl records and magnetic tape played directly through an amplifier) has gone through an encoding process. While this can involve high sampling rates and very high-fidelity capture of the original analogue wave form, it will in almost all cases have been subject to a process of band-pass filtering where sounds at a frequency above or below "normal" hearing ranges have been removed (usually from 20Hz to 20kHz). So, although sound encoded at the sampling rate of a CD or higher is often referred to as "lossless", in practice not all of the original information is present and inaudible frequencies and harmonics will be missing;
B. Compression
For broadcast, recording and/or reproduction, audio signals will usually have undergone some form of audio compression, e.g. dynamic range compression. There are lossless forms of compression which can be restored to the full original encoding, but in practice audio signals tend to go through a lossy compression process using a codec (coder-decoder) which removes some of the audio information. For example, codecs commonly use a psychoacoustic technique that relies on knowledge of how humans perceive sound. Pyschoacoustic codecs compress the sound by removing parts of the sound that humans do not pay attention to, and/or devoting fewer bits of the data stream to capturing parts of the signal which are less important to the human experience than the others. So, for example, a codec might:
1 ) divide up the audio signal into multiple frequency bands and devote fewer bits of the compressed encoding to the highest or lowest frequency bands where the human ear/brain is less discerning and more to the range in which normal speech occurs
2) devote fewer bits to the sound immediately after a loud noise, during which time it is known that the brain is paying less attention;
3) devote fewer bits to frequency ranges with less acoustic energy in the signal - louder sounds are known to mask quieter sounds in human perception; and/or 4) further remove the highest and lowest frequency sounds, i.e. be more aggressive in removing those frequencies which few people can hear - especially as they get older. Not all audio codecs make use of psycho-acoustics to an appreciable degree e.g. the popular Aptx (trade mark) codec provided by Qualcomm. In such cases other techniques such as "dithering" are used to mask the audibly unpleasant artefacts of the compression process, and that in turn raises the noise floor of the signal which can be detected as an artefact in the audio signal. Hence, compression of an audio signal can lead to the presence of detectable artefacts in the signal that are not necessarily the result of psycho-acoustic compression techniques.
C. Reproduction
When decoding an encoded signal back to renderable sound, an amplifier generates a varying voltage/current to operate a loudspeaker. In practice, both the amplifier's electrical characteristics, and the loudspeaker's mechanical characteristics leave an imprint on the sound being produced - often referred to as the "transfer function". In most cases loudspeakers associated with a TV have a limited frequency response and yet more of the high and low frequencies will be lost.
In a typical broadcast chain, the audio signal is likely to undergo encoding, compression, decoding and decompression at least once and often more than once as it passes through the various network links from initial recording, to studio to transmitter. Different codecs may be used at different stages so the end result may bear traces of more than one kind of processing.
As a result of any one or more of the above (and/or other) processes, processed sounds commonly have one or more characteristics that non-processed real-world sounds do not have, and vice versa. For example, non-processed real-world sounds tend to include audio signal components at higher and/or lower frequencies than processes sounds such as those emanating from a television or audio system. Also, non-processed real-world sounds tend to have less inherent background noise than processed signals. Further, non-processed real-world sounds tend to have a more natural spread of frequency components than processed sounds. Hence the spectral distribution (which may be referred to as spectral power distribution) of the or each audio signal representing a detected sound can provide an indication of whether the sound is a real world sound or not.
Even within frequency band(s) that are common to both processed and non-processed sounds, the frequency distribution, i.e. the distribution of the frequency components of the audio signal, and other characteristics of processed sound are detectably different from those of real-world non-processed sounds. Some of these characteristics are complex, e.g. changes in bitrates of encoding (e.g. lower bit rate after a loud noise, or for very high or low frequencies), and introduce identifiable artefacts into the processed audio signals. A processed audio signal may include detectable artefacts arising from any one or more of the processes described above. Accordingly, any sound (and more particularly any corresponding audio signal representing the sound) detected by the device 10 may be analysed in respect of any one or more signal
characteristics in order to identify it as a processed sound or a non-processed sound. The relevant characteristics include, but are not limited to:
i. the frequency content of the audio signal, in particular the presence or absence of signal components in one or more frequency bands, especially a high frequency band (e.g. above 20kHz or above 500kHz) and/or a low frequency band (e.g. below 20Hz or below 50Hz). ii. the spectral distribution of the audio signal, especially within one or more frequency bands, e.g. between 20Hz and 500kHz, or between 500Hz and 2kHz, from 500Hz to 50kHz (or other frequency range, e.g. a frequency range deemed to correspond with the human voice) iii. the bitrate of the audio signal, including the absolute bitrate and/or changes in bitrate. For example this may involve detecting different bitrates being used for different frequency components (in particular relatively low bit rates being used for high (e.g. >15kHz) and/or low (e.g. <500Hz) frequency bands), and/or relatively low bitrates being used after a signal event such as loud noise (which may be referred to as a high intensity signal event).
iv. The noise floor level of the audio signal, in particular a relatively high noise floor level (e.g. above a threshold value that can be determined from reference data).
To make efficient use of computational resources, it is preferred to trigger the sound analysis once a minimum sound level and/or duration has been reached. For example, a rolling window of sound (e.g. of up to a few seconds) may be captured continuously from each microphone 18, and once the trigger condition(s) has been met a sound segment of defined duration, commencing with the trigger sound may be put into a queue for analysis. Any convenient technique to implement early, random or other discard technique may be employed if the queue grows beyond acceptable limits.
Figure 3 shows a preferred operation of the device 10 in the listening mode. In step 301 the device 10 captures a sample of detected sound from the output of one or more microphone 18 in response to the trigger condition(s) being met. In step 302 the device 10 performs location analysis on the detected sound as described above. This may involve determining the location of the sound's source using the phase difference between corresponding signals captures from at least two microphones 18 and/or sound intensity difference between corresponding signals captures from at least two microphones 18, and may depend on the directional sensitivity of the or each relevant microphone 18. Sounds that are determined as having emanated from the location of a know loudspeaker 14 (as determined during the training mode) can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound.
In step 303 the device 10 performs transfer function, or frequency spectrum, analysis of the detected sound to identify one or more frequency characteristics that are indicative of it being either a real- world sound or a non-real-world sound. Typically this involves determining that the sound is a processed, or non-real-world, sound if it lacks high and/or low frequency components that are commonly removed by audio encoding and/or by rendering via amplifier and/or loudspeaker. Sounds that are determined as having been processed can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound. Alternatively or in addition the transfer function analysis may involve comparing the sound sample (conveniently a transfer function representing the sound sample) against one or more transfer function template associated with audio recording, audio broadcast and/or audio reproduction. Any audio playback system will have a transfer response h(t) and corresponding frequency domain response H(s). Playing the audio source signal sig(t) through the system will convolve sig(t) with h(t), or in a frequency domain representation, multiplication of SIG(s) (being the frequency domain representation of sig(t)) with H(s). For a given transient signal that has sufficient bandwidth across the region of H(s) where there is maximal variability, it is possible to recover an estimate of the multiplicative envelope of H(s) through parameter fitting to produce an estimate with some measure of certainty that the source signal was altered by reproduction through a rebroadcast system. The fitting technique can use any number of standard parametric techniques. The transfer function for broadcast compression and blu-ray encoding can for example be used as templates. Such templates are best suited to transients such as gunshots, breaking glass or TV screams, it will be less effective for narrower band sounds such as vehicle noise or human speech that does not have significant variability (inflection or emotion).
In step 304 the device 10 looks for artefacts in the detected sound (i.e. in the corresponding frequency spectrum and/or waveform of the corresponding audio signal) which indicate that the sound has been subjected to audio compression, e.g. psycho-acoustic compression or other compression technique. This may involve identifying relatively low bitrate encoding in high and/or low frequency bands and/or a reduction in encoding quality after a loud noise, and/or a noise floor level that can be associated with compression. Sounds that are determined as having been subjected to compression can be rejected, i.e. ignored, or marked as being suspected of being a non-real-world sound. Sounds can be deemed to be processed or non-real-world sounds (and therefore ignored or rejected) upon being identified as such by any one of steps 302, 303 or 304, or alternatively upon being identified as such by any two or more of steps 302, 303 and 304. Any determinations made by the audio signal processor 20 in this regard may be communicated to the controller 1 1 which may make the decision on whether or not to ignore the detected sound and/or to determine which actions are to be taken in response to the detected sound.
It is noted that the sequencing of the sequence of steps 302, 303, 304 in Figure 3 is illustrative and in alternative embodiments, these (and/or other) steps may be performed in different orders, or be merged, and/or be operated in parallel, dependent on the requirements of the application and capabilities of the device 10. In preferred embodiments the device 10 combines the techniques of location analysis, spectrum analysis and artefact detection. In alternative embodiments any one of these techniques may be used on its own, or in combination with any other of the techniques. For example, spectrum analysis and artefact detection in particular may each be sufficient on its own to achieve an effective level of specificity for a given use-case.
It is noted that there are at least two approaches to implementation of spectrum analysis and artefact detection:
1 ) Development of one or more specific algorithm to detect the or each relevant signal
characteristic, for example based on analysis of reference data, and
2) Training using machine-learning techniques - this may, for example, involve training the device 10 with training data which may comprise pairs of sound samples - an original sound generated live (or a very high-fidelity or artefact-free recording), and the same sound after typical encoding, compression and/or reproduction.
The second process does not generate an algorithm as such and it may not be apparent how the system is achieving subsequent levels of effective differentiation. The machine-learning approach may also collapse steps in the processing: in other words it may not be necessary to separately look for spectrum differences and compression artefacts, a trained system may just learn the difference between processed and non-processed sounds using whatever characteristics it finds to be most capable of allowing the distinction to be made. The machine learning approach may involve providing the device 10 with reference real-world and non-real-world sounds, the device 10 being configured through machine-learning to develops its own criteria empirically for distinguishing between them. These criteria may involve elements of location, spectral distribution and artefacts, and may differ for different types of input sound.
An example of a practical application of the device 10 is now described for illustration purposes. In this example it is assumed that the device 10 is installed in the home of a vulnerable person to monitor their health and safety. For maximum visibility of the room 12 being monitored the device 10 is positioned on top of the TV set 16.
The device 10 is intended to monitor for coughs, sneezes, cries for help, sounds of danger and other noises but in a normal home the TV is likely to be active for several hours a day and generate many similar artificial sound events.
The device 10 has a plurality of microphones 18 and audio signal processing circuitry 20 configured to perform the following:
• Separate processing of the audio signal from each microphone 18
• Capture of audio input samples in response to detection of a trigger signal, e.g. when sound exceeding a trigger intensity and/or duration is detected
• Measurement of the phase shift between corresponding sound samples from each (or at least two) microphone 18 • Audio signal analysis of each sample, which may involve transfer function analysis and or artefact detection.
During the training mode the device 10 determines the position of the loudspeakers 14 within the room 12, preferably by playing test signals through the television (e.g. via HDMI or other connection) and detecting the corresponding sounds rendered by the loudspeakers 14 using the microphones 18. At a minimum it is preferred that alternate left and right channel test signals are used, but more preferably test signals for 2.1 , 5.1 and 7.1 sound set-ups are used, selecting channels and frequencies as appropriate.
During the listening mode, the device 10 can preform location analysis and reject sounds from the designated speaker locations. Alternately, sounds from those locations can simply be marked as "suspect" and further processed before making a final decision, for example based on weighted probabilities from each phase of analysis.
The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.

Claims

CLAIMS:
1. A method of operating a sound responsive device comprising at least one microphone for receiving sounds from an environment, the method comprising: detecting a sound at said at least one microphone;
producing a corresponding audio signal from the or each microphone;
performing audio signal processing on the corresponding audio signal from the or each microphone;
determining from said audio signal processing if said sound is a real-world sound;
performing at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
2. The method of claim 1 , wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic corresponding to one or more audio signal processing process.
3. The method of claim 2, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio recording, audio broadcast and/or audio reproduction.
4. The method of claim 2 or 3 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio encoding.
5. The method of any one of claims 2 to 4 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio compression.
6. The method of claim 5, wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more compression artefact.
7. The method of any one of claims 2 to 6 wherein said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more characteristic associated with audio rendering by an electronic amplifier and/or loudspeaker.
8. The method of any one of claims 2 to 7, wherein said audio signal processing comprises frequency analysis, and said determining comprises determining that said sound is not a real-world sound if the corresponding audio signal from the or each microphone comprises one or more frequency characteristic corresponding to one or more audio signal processing process.
9. The method of claim 8 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal.
10. The method of claim 9 wherein said one or more frequency characteristic comprises a spectral distribution of said audio signal in one or more frequency band that is common to both real-world and non real-world sounds.
1 1. The method of claim 10 wherein said one or more frequency band comprises the frequency band from 20Hz to 500Hz, or from 500Hz to 50kHz.
12. The method of any one of claims 8 to 1 1 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in one or more frequency bands.
13. The method of claim 12 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a low frequency band, for example below 500Hz.
14. The method of claim 12 or 13 wherein said one or more frequency characteristic comprises an absence of frequency components of said audio signal in a high frequency band, for example above 10kHz.
15. The method of any one of claims 2 to 14 wherein said one or more characteristic comprises one or more bit rate characteristic.
16. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate.
17. The method of claim 15 or 16 wherein said one or more bit rate characteristic comprises use of different bit rates for different frequency bands of the audio signal.
18. The method of claim 17 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively low frequency band, for example below 500Hz.
19. The method of claim 17 or 18 wherein said one or more bit rate characteristic comprises use of a relatively low bit rate for a relatively high frequency band, for example above 10kHz.
20. The method of claim 15 wherein said one or more bit rate characteristic comprises a change in bit rate, in particular a reduction of the bit rate, after a high intensity signal event.
21. The method of any one of claims 2 to 20, wherein said one or more characteristic comprises noise floor level.
22. The method of claim 21 , wherein said one or more characteristic comprises the noise floor level being above a threshold level.
23. The method of any preceding claim wherein said determining comprises determining if said sound was rendered by a loudspeaker.
24. The method of any preceding claim wherein said determining from said audio signal processing if said sound is a real-world sound comprises comparing the audio signal from the or each microphone against at least one reference template, and determining that said sound is not a real world sound if said audio signal matches said at least one reference template.
25. The method of claim 24 wherein the or each template is a transfer function template, for example a transfer function template corresponding to any one of an audio recording process, and audio broadcast process and/or an audio reproduction process.
26. The method of any one of claims 2 to 25 wherein said one or more characteristics are derived empirically from training data.
27. The method of claim 26 wherein said training data comprises data representing pairs of non- processed and corresponding processed sound samples.
28. The method of claim 26 or 27 wherein said one or more characteristics are derived from said training data by machine-learning.
29. The method of any preceding claim wherein said determining if said sound is a real-world sound comprises determining that said sound is not a real-world sound if it emanated from any one of at least one designated location in said environment.
30. A sound responsive device comprising at least one microphone for receiving sounds from an environment, the device further comprising audio signal processing means configured to perform audio signal processing on audio signals produced by the or each microphone to determine if said sound is a real-world sound, the device being configured to perform at least one action in response to detection of said sound only if said sound is determined to be a real-world sound.
PCT/EP2018/067333 2017-06-28 2018-06-27 Sound responsive device and method WO2019002417A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1710286.4A GB2563868B (en) 2017-06-28 2017-06-28 Sound responsive device and method
GB1710286.4 2017-06-28

Publications (1)

Publication Number Publication Date
WO2019002417A1 true WO2019002417A1 (en) 2019-01-03

Family

ID=59523583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/067333 WO2019002417A1 (en) 2017-06-28 2018-06-27 Sound responsive device and method

Country Status (2)

Country Link
GB (1) GB2563868B (en)
WO (1) WO2019002417A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578688A (en) * 2019-09-12 2021-03-30 中强光电股份有限公司 Projection device and sound output control device thereof
WO2021099760A1 (en) * 2019-11-21 2021-05-27 Cirrus Logic International Semiconductor Limited Detection of live speech

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240647B (en) * 2022-06-20 2024-10-22 西北工业大学 Sound event detection method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
JP2005250233A (en) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd Robot device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998034216A2 (en) * 1997-01-31 1998-08-06 T-Netix, Inc. System and method for detecting a recorded voice
JP2005250233A (en) * 2004-03-05 2005-09-15 Sanyo Electric Co Ltd Robot device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BRIAN D'ALESSANDRO ET AL: "Mp3 bit rate quality detection through frequency spectrum analysis", MULTIMEDIA AND SECURITY, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 7 September 2009 (2009-09-07), pages 57 - 62, XP058088142, ISBN: 978-1-60558-492-8, DOI: 10.1145/1597817.1597828 *
GRIGORAS ET AL: "Statistical Tools for Multimedia Forensics", CONFERENCE: 39TH INTERNATIONAL CONFERENCE: AUDIO FORENSICS: PRACTICES AND CHALLENGES; JUNE 2010, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, 17 June 2010 (2010-06-17), XP040567050 *
HANY FARID: "Detecting Digital Forgeries Using Bispectral Analysis", 1 January 1999 (1999-01-01), XP055499185, Retrieved from the Internet <URL:https://dspace.mit.edu/bitstream/handle/1721.1/6678/AIM-1657.pdf?sequence=2> [retrieved on 20180813] *
HENNEQUIN ROMAIN ET AL: "Codec independent lossy audio compression detection", 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 5 March 2017 (2017-03-05), pages 726 - 730, XP033258513, DOI: 10.1109/ICASSP.2017.7952251 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112578688A (en) * 2019-09-12 2021-03-30 中强光电股份有限公司 Projection device and sound output control device thereof
WO2021099760A1 (en) * 2019-11-21 2021-05-27 Cirrus Logic International Semiconductor Limited Detection of live speech
GB2603397A (en) * 2019-11-21 2022-08-03 Cirrus Logic Int Semiconductor Ltd Detection of live speech
US11705109B2 (en) 2019-11-21 2023-07-18 Cirrus Logic, Inc. Detection of live speech
US12142259B2 (en) 2019-11-21 2024-11-12 Cirrus Logic Inc. Detection of live speech

Also Published As

Publication number Publication date
GB201710286D0 (en) 2017-08-09
GB2563868A (en) 2019-01-02
GB2563868B (en) 2020-02-19

Similar Documents

Publication Publication Date Title
JP7271674B2 (en) Optimization by Noise Classification of Network Microphone Devices
US11183198B2 (en) Multi-mode audio recognition and auxiliary data encoding and decoding
JP7397066B2 (en) Method, computer readable storage medium and apparatus for dynamic volume adjustment via audio classification
US10275210B2 (en) Privacy protection in collective feedforward
JP6576934B2 (en) Signal quality based enhancement and compensation of compressed audio signals
CN102016994B (en) An apparatus for processing an audio signal and method thereof
CN103886731B (en) A kind of noise control method and equipment
US9959886B2 (en) Spectral comb voice activity detection
US9584899B1 (en) Sharing of custom audio processing parameters
CN102124758A (en) Hearing aid, hearing assistance system, walking detection method, and hearing assistance method
RU2008142956A (en) DEVICE FOR DATA PROCESSING AND METHOD OF DATA PROCESSING
CN105430191B (en) The adjusting processing method and processing device of volume
KR20200113058A (en) Apparatus and method for operating a wearable device
WO2019002417A1 (en) Sound responsive device and method
CN107168677A (en) Audio-frequency processing method and device, electronic equipment, storage medium
CN107358964B (en) Method for detecting warning signs in changing environment
CN103903606A (en) Noise control method and device
US10853025B2 (en) Sharing of custom audio processing parameters
TWI831785B (en) Personal hearing device
CN109997186B (en) Apparatus and method for classifying acoustic environments
WO2017156895A1 (en) Multimedia playing method and device
EP2849341A1 (en) Loudness control at audio rendering of an audio signal
EP3419021A1 (en) Device and method for distinguishing natural and artificial sound
JP2010230972A (en) Voice signal processing device, method and program therefor, and reproduction device
CN213877572U (en) Human voice enhancement and environment prediction system based on deep learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18734552

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.03.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18734552

Country of ref document: EP

Kind code of ref document: A1