GB2541466A

GB2541466A - Replay attack detection

Info

Publication number: GB2541466A
Application number: GB1514943.8A
Authority: GB
Inventors: Fauve Benoit; Ming Liu Wei
Original assignee: Validsoft UK Ltd
Current assignee: Validsoft UK Ltd
Priority date: 2015-08-21
Filing date: 2015-08-21
Publication date: 2017-02-22
Anticipated expiration: 2035-08-21
Also published as: GB201514943D0; GB2541466B

Abstract

A voice-based biometric authentication system distinguishes live utterances from replayed utterances by extracting spectral information from a voice sample (eg. mean signal magnitude for a frequency/time sub-window of a speech frame or Linear Frequency Cepstral Coefficients) and comparing them to eg. live voice models and replay voice models generated via probabilistic models.

Description

REPLAY ATTACK DETECTION

[0001] This invention relates to determining whether a received voice sample is a replay of a pre-recorded voice sample. In particular, certain embodiments of the present invention relate to determining whether or not (or whether it is likely or not that) the received sample corresponds to a live voice sample or a replayed voice sample. Furthermore, certain embodiments of the present invention describe making this determination using spectral information extracted from at least one frame of the received sample.

BACKGROUND

[0002] Biometric authentication systems and methods are used in many areas of modern life. In particular biometric authentication is frequently used for systems where sensitive or personal data is involved, to secure transactions which are considered high risk, or other scenarios where security is a concern. Biometric authentication may be employed when a participant in such a system needs to transmit, receive or otherwise access sensitive or personal data. Where sensitive or personal data is transmitted to or received from another participant, one or both participants may need to be authenticated to ensure their identity is satisfactory to the other participant. Often it may just be the participant who wishes to be sent data who requires authentication. This may be due to the other party being implicitly authenticated through their possession of the data.

[0003] Voice-based biometric authentication systems have gained in popularity in electronic commerce systems, due to potentially allowing for an authentication means which is more advanced and secure than a traditional pin/password scheme. Voice authentication is readily implemented due to the wide-spread adoption of mobile electronic devices incorporating microphones, such as smart- and other phones, allowing the user to directly transmit his/her voice sample to the authenticating system. While it is becoming increasingly important to ensure a satisfactory user experience, the security of such systems is paramount. As usage and popularity of voice biometric systems increase, so does complexity of attack by fraudulent parties.

[0004] There are two main streams of voice authentication systems: text-dependent and text-independent. A text-dependent (TD) system requires the user to say exactly the given or enrolled passphrase(s) during authentication. Meanwhile, a text-independent (Tl) system poses no such constraint on speech content and the user is able to speak freely to the system. The authentication process is similar for both Tl and TD apart from the content of the passphrase. The authentication process is depicted in Figure 1, where an electronic device 200 receives a voice sample from a user 100. The electronic device 200 may then transmit the received voice sample (for instance across a wired or wireless telephony channel) to a voice biometrics engine 350 which is configured to assess whether or not the received voice sample was provided by the legitimate user. For example, the voice biometrics engine 350 may achieve this by comparing the received voice sample to a model which has previously been generated. Based on an outcome of this comparison, the voice biometric engine 350 determines the likelihood that the user 100 is the legitimate user 100 who performed the initial enrolment, or another user (either another legitimate user or potentially a fraudster attempting to impersonate the first legitimate user). For instance, a comparison score may be checked against a threshold which is set in the voice biometric engine 350. If the comparison result exceeds the threshold (that is, when the voice sample corresponds suitably well to the model), then it is determined that the user 100 is indeed the legitimate user 100. If the comparison result does not exceed the threshold (that is, when the voice sample does not correspond suitably well to the model), then it is determined that the user 100 is not the legitimate user 100. It will be apparent that increasing or decreasing the threshold will affect the security of an authentication system, with a higher threshold requiring a closer match between a received utterance of the passphrase and the stored model before authenticating.

[0005] A TD system may require less training and testing utterances than a Tl system to achieve a good performance. By using a specific passphrase, the authentication process is also faster as the time taken to recite the specific passphrase is relatively short. This improves user convenience. However, a TD system is extremely susceptible and vulnerable to replay attack. For example, a fraudulent user (fraudster) attempts to authenticate themself in the system as the legitimate user using a recording of the user uttering the passphrase. An example of such an attack is illustrated in Figure 2, where it is shown that a fraudster 150 records the legitimate user 100 speaking the passphrase. The fraudster 150 then provides the recording to the electronic device 200 when the fraudster wishes to gain access. For example, the fraudster 150 may use a recording device 250 comprising a microphone (for example, a smartphone) to record the legitimate user 100 uttering the passphrase. The fraudster 150 may then use a device 250 comprising a loudspeaker (which may be the same recording device as used in the recording, as would be capable with a smartphone) to play the recorded utterance to the electronic device 200, which then provides it to the voice biometrics engine 350. Alternatively, the fraudster may provide the recorded utterance to the voice biometrics engine 350 through directly passing the recording to a telephony channel (established with the voice biometrics engine 350) without needing to use a loudspeaker. Due to the replayed utterance being of the correct passphrase and from the legitimate user 100, the voice biometrics engine 350 is likely to determine that it is suitably close to the model and so allows access to the fraudster 150.

[0006] A Tl system requires the user to speak for a longer period of time before an authentication system is able to suitably determine whether or not the user is legitimate. Like TD, Tl is also vulnerable to replay attacks; and there could be more opportunity for the fraudster to do so as the legitimate user could be recorded as saying anything (not limited to a specific passphrase). Once enough speech has been collected, the fraudster could replay the recordings in attempt to gain authentication as the legitimate speaker. Alternatively, the fraudster could pick and mix the right word from his recordings to form the correct passphrase to gain access through a TD system. In conclusion, replay attack can attack any kind of voice authentication system be it passphrase for TD, free speech for Tl or even hybrid system such as a text-prompted system if the fraudster puts in enough effort.

BRIEF SUMMARY OF THE DISCLOSURE

[0007] It is an aim of certain embodiments of the present invention to provide a method for determining an indication of whether a received voice-related sample relates to a live utterance spoken and coming directly from a human speaker (hereinafter referred to as the ‘live utterance’) or a replayed utterance that is pre-recorded and replayed back using a device (hereinafter referred to as the ‘replayed utterance’). Receiving the latter constitutes a replay attack.

[0008] According to first aspect of the present invention, a method for detecting a replay attack in a biometric system is provided. The method comprises: receiving a voice sample; extracting spectral information from the voice sample; processing the spectral information to obtain data corresponding to at least one of: signal magnitude at a specific time; the mean of signal magnitude for a sub-window defining a frequency and time range; and the standard deviation of signal magnitude for a sub-window defining a frequency and time range; and determining, based on the obtained data, an indication regarding whether the sample relates to a live utterance or a replayed utterance.

[0009] In certain embodiments, the spectral information is extracted from at least one frame of the voice sample; and the at least one frame corresponds to a speech region of the voice sample.

[0010] In certain embodiments, the determining comprises: comparing the obtained data against corresponding data obtained from at least one of a live voice model and a replay voice model; and determining the indication based on the result of the comparison [0011] In certain further embodiments, the live voice model and the replay voice model are generated using probabilistic models.

[0012] In certain further embodiments, the specific time is one of a plurality of predetermined times; the spectral information is processed to obtain data corresponding to each of the predetermined times; and determining the indication further comprises: comparing the obtained data for each of the predetermined times against corresponding data obtained from at least one of the live voice model and the replay voice model; and determining the indication based on a combination of the results of the comparisons.

[0013] According to a second aspect of the present invention, a method for detecting a replay attack in a biometric system is provided. The method comprises: receiving a voice sample; extracting spectral information from the sample; processing the spectral information to obtain a plurality of Linear Frequency Cepstral Coefficients, LFCCs, to which unity-variance normalisation is applied; and determining, based on a comparison between the obtained LFCCs and at least one of a live voice model and a replay voice model, an indication regarding whether the sample relates to a live utterance or a replayed utterance.

[0014] In certain embodiments, the spectral information is extracted from at least one frame of the voice sample; and the at least one frame corresponds to a speech region of the voice sample.

[0015] In certain embodiments, the determining comprises comparing at least one of the obtained LFCCs and delta LFCCs derived from the obtained LFCCs with corresponding statistics from at least one of the live voice model and the replay voice model; and determining the indication based on the result of the comparison [0016] In certain other embodiments, the determining comprises: comparing the obtained LFCCs with corresponding statistics from at least one of the live voice model and the replay voice model; and determining the indication based on the result of the comparison; wherein the live voice model is trained with speech related data and the replay voice model is trained with non-speech related data.

[0017] In certain further embodiments, zero-mean normalization is further applied to the obtained LFCCs; and wherein comparing the obtained LFCCs comprises comparing the zero-mean normalized LFCCs to corresponding statistics from at least one of the live voice model and the replay voice model.

[0018] In certain embodiments, the live voice model and the replay voice model are generated using probabilistic models.

[0019] In certain embodiments of either aspect, wherein if it is indicated that the received sample relates to the live utterance, the method further comprises determining whether or not to authenticate on the basis of the received sample; and wherein if it is indicated that the received sample relates to the replay utterance, the method further comprises denying authentication.

[0020] In certain embodiments of the second aspect, the method further comprises: processing the spectral information to obtain data corresponding to at least one of: signal magnitude at a specific time; the mean of signal magnitude for a sub-window defining a frequency and time range; and the standard deviation of signal magnitude for a subwindow defining a frequency and time range; and determining, based on the obtained data, a further indication regarding whether the sample relates to a live utterance or a replayed utterance.

[0021] In certain embodiments of either aspect, the method further comprises: processing the spectral information to obtain a plurality of ratios of signal magnitude at a specific time, wherein each ratio is calculated between two frequency ranges; calculating at least one product of at least two of the calculated ratios; and determining, based on the at least one product, a further indication regarding whether or not the received sample relates to a live utterance or a replayed utterance.

[0022] In certain further embodiments, determining a further indication comprises comparing the product to a threshold; wherein the indication is based on the result of the comparison.

[0023] In certain further embodiments, the specific time is one of a plurality of predetermined times; and determining the further indication is based on a combination of a plurality of at least product, said at least one product having been calculated for each of the predetermined times.

[0024] Also disclosed is a method for detecting a replay attack in a biometric authentication system, the method comprising: receiving a sample related to a voice of a user; extracting spectral information from at least one frame of the sample, the at least one frame corresponding to a speech region of the sample; processing the spectral information to obtain a plurality of ratios of spectral magnitude, wherein each ratio is calculated between a pair of frequency ranges; and determining, based on the product of at least one of the calculated ratios, an indication regarding whether or not the received sample relates to a live utterance or a replayed utterance.

[0025] According to certain embodiments of the present invention, a method for detecting a replay attack in a biometric authentication system may comprise any combination of the above-described methods for detecting a replay attack; wherein an indication regarding whether or not a received sample relates to a live utterance or a replay utterance is determined based on individual indications resulting from each method.

[0026] According to another aspect of the present invention, a replay detector is provided, wherein the replay detector is arranged to perform at least one of the above-described methods for detecting a replay attack in a biometric authentication system.

[0027] Another aspect of the present invention provides a computer program comprising instructions arranged, when executed, to implement a method and/or apparatus in accordance with any one of the above-described aspects. A further aspect provides machine-readable storage storing such a program.

BRIEF DESCRIPTION OF THE DRAWINGS

[0028] Embodiments of the invention are further described hereinafter with reference to the accompanying drawings, in which:

Figure 1 illustrates a user performing an authentication process with an authentication device;

Figure 2 illustrates a fraudulent user performing an authentication process with an authentication device using a replay attack;

Figure 3 schematically illustrates an authentication system in accordance with an embodiment of the present invention;

Figure 4 schematically illustrates a replay attack detection method in accordance with an embodiment of the present invention;

Figure 5 is a flowchart illustrating a method of detecting a replay attack according to an embodiment of the present invention;

Figure 6 is a flowchart illustrating a method of detecting a replay attack according to another embodiment of the present invention;

Figure 7 is a flowchart illustrating a method of detecting a replay attack according to yet another embodiment of the present invention;

Figure 8 shows examples of spectrogram outputs for a live signal and a replayed signal; Note that the magnitude has been presented in logarithmic scale.

Figure 9 shows examples of gradient of magnitude of time-frequency response at time t=2.05s for signals corresponding to the spectrogram outputs of Fig. 5; Gradient is taken to illustrate the periodic peaks and troughs more effectively.

Figure 10 shows examples of mean of magnitude of a sub-window along the frequency axis at time t=1.65s for signals corresponding to the spectrogram outputs of Fig. 5;

Figure 11 shows examples of standard deviation of magnitude of a sub-window along the frequency axis at time t=1.35s for signals corresponding to the spectrogram outputs of Fig. 5.

Figure 12 shows examples of static and delta LFCCs for a live signal and a replayed signal.

DETAILED DESCRIPTION

[0029] Embodiments of the present invention will now be described in the context of voice-based biometric authentication system in which a user repeats a passphrase to an authentication device in an attempt to authenticate themselves, thereby being recognised as the legitimate user of a service. However, it will be appreciated that the present invention may be implemented in other suitable systems such as those involving multiple passphrases.

[0030] Furthermore, embodiments of the present invention may extend to systems whereby a passphrase is recreated, by a fraudster, from one or more recordings of the legitimate user speaking. In these cases, the legitimate user may not necessarily be uttering the passphrase, and so the fraudster uses recordings of unrelated utterances (that is, speech of the legitimate user which may not be intended for the authentication process) to reconstruct the required passphrase.

[0031] Further still, embodiments of the present invention may extend to text-independent systems where a passphrase is not utilised in the authentication method. In these cases, the fraudster may provide recordings of utterances of the legitimate user to a voice biometric engine to be authenticated in the system. For example, the fraudster may have recorded or obtained a suitable amount of sample utterances (which may not specifically be related to the authentication process) from the legitimate user such that, when provided with a voice sample comprising the obtained utterances, the voice biometric engine is convinced that the voice sample is from the legitimate user (that is, the voice biometric engine (described below) confirms that the received voice sample relates to the acoustic features known to be associated with the voice of the legitimate user).

[0032] In a traditional voice-based authentication system, the decision whether or not to authenticate the user may be based on a comparison between a received utterance and a single stored model corresponding to an utterance of the legitimate/claimed speaker. The received utterance is either a passphrase of fixed content for text-dependent (TD) systems or free-speech for text-independent (Tl) systems The stored model is usually previously registered in the system by the legitimate user during an enrolment process. As a result, the system is merely checking whether the stored acoustic features correspond to the acoustic features of the received utterance, and authenticates or denies the user on the basis of this comparison. This manner of system is not able to detect whether or not the received utterance is vocalised by a human (a live utterance) or from a loudspeaker or equally a recording inserted into a telephony channel by the fraudster (a replayed utterance). Furthermore some Voice Biometrics systems include normalisation techniques which are specifically designed to neutralize the acoustic variations. The acoustic variations from a replayed signal could to some extent be neutralised by the normalisation techniques hence making the system even more vulnerable to a replay attack.

[0033] It has been observed that the act of recording an utterance has the effect of altering the acoustic features (for example, spectral properties) of the utterance. That is, differences exist between the live utterance and its recorded version. For instance, to record the utterance, a fraudster may be a distance from the legitimate user when the legitimate user is speaking. Here, far-field recording causes increase in the noise and reverberation level of the signal. These effects on the recorded signal have occurred due to reduced modulation indexes of the signal, and also due to flattening of the spectrum. It will be appreciated that these effects on the recorded signal are in addition to the effect of capturing a voice sample through a microphone which would be applied equally to the live utterances of a legitimate user when seeking authentication. In other words, a replayed utterance contains channel noise by two recording devices and one playback device; whilst a live (legitimate) utterance would only have channel noise inflicted by the recording device of the Voice Biometric system.

[0034] Additionally, it has been observed that replaying a recorded utterance using a loudspeaker, for example, introduces further changes to the acoustic features of the recorded utterance. Of course, those further changes are not applicable where direct insertion of the recorded utterance into the telephony channel is used. These further changes may have the effect of even further distinguishing the acoustic features of a replayed utterance from that of a live utterance. For instance, a loudspeaker such as may be used by the fraudster may be unable to provide a good frequency response at low frequencies. As a result, there may be a noticeable reduction in signal amplitude at low frequencies for a replayed voice sample in comparison to the low frequency signal amplitude of a live voice sample.

[0035] In an embodiment of the present invention, a voice-based biometric authentication system is implemented in a system comprising an authentication device and a replay detector module. It will be appreciated that these components of the system may be separate devices or combined together in some suitable manner. For instance, the replay detector module may, for example, be implemented in a server in communication with an authentication device which is implemented using an electronic device (for example, a mobile smartphone).

[0036] An example of this type of system is shown in Figure 3. In figure 3, an electronic device 200 receives a voice-related sample from a user 100 corresponding to an utterance of a passphrase. The electronic device 200 may provide this sample to a replay detector 300. The replay detector 300 may then pass the sample to a voice biometrics engine 350. It will be appreciated that the voice biometrics engine 350 may be combined with the replay detector 300 in a single component. Likewise, it will be understood that the voice biometrics engine 350 and the replay detector 300 may form separate components. Furthermore, the replay detector 300 and the voice biometrics engine 350 components may be provided in sequence or in parallel, such that the order of their operation is simultaneous (for example, both components receive the voice sample and perform their respective operations before it is determined whether or not to authenticate the sample provider in the system) or contingent on the other component having performed its operations first (that is, one of the components receives the voice sample first, performs its respective operations, and then, potentially depending on the outcome of these operations, transmits the voice sample to the other component).

[0037] The replay detector 300 is configured to analyse the sample and determine the likelihood that it corresponds to either a legitimate attempt to gain authorization in the system or a replay attack most likely from a fraudster. It will be appreciated that, in certain embodiments, this likelihood of legitimacy may be represented as an indication, score or series of scores which result from the analysis performed by the replay detector 300. Additionally, it will be appreciated the indication may be an absolute indicator that the sample corresponds to a legitimate attempt or a replay attack, or may be a probability related to one (or both) of these options.

[0038] Upon performing analysis of the received sample and determining the likelihood of it being a live sample or a replayed sample, it may be determined whether or not to authenticate the user 100 on the basis of the received sample. That is, the received sample may be compared to a previously registered utterance of the passphrase which is known to have been provided by the legitimate user. In the following, the terms “live utterance” and “replay(ed) utterance” will be used to distinguish between a voice sample received from a human and a voice sample received from a machine or apparatus (for example, from a loudspeaker, or from an electronic device through a telephony channel). It will be appreciated that there are other manners in which these features could be distinguished. For instance, the “live utterance” could also be regarded as an “original utterance” (indicating the recorded/replayed nature of the replayed utterance) or could be given the moniker of a “legitimate” utterance, where legitimate is merely intended to demonstrate that the voice sample originated from a human being (that is, the mouth, vocal tract etc. of a person) and not from some apparatus or machinery. Similarly, the replayed utterance could also be referred to as a “fraudulent” utterance, in view of the likelihood that a replayed voice-related sample may be associated with an illegitimate attempt to gain authorization in a system.

[0039] This determination of whether or not to authenticate may be performed by the voice biometrics engine 350. Alternatively, the skilled person will appreciate that this determination of whether the user is the claimed identity may be performed at the same time as the determination of the likelihood of legitimacy of the voice sample being an original utterance (a live utterance). That is, while analysing the sample to identify whether the acoustic features of the sample indicate that the sample has been provided by a human or is a replayed signal (most likely by a fraudster), the acoustic features may also be checked, by the voice biometrics engine 350, against the previously enrolled/registered utterance. It will be described according to certain embodiments below that the determination of the likelihood of legitimacy may involve the use of a model that is trained to represent live utterances and another model that is trained to represent replayed utterances.

[0040] Figure 4 illustrates example components of the replay detector 300. As shown, the replay detector 300 may comprise a speech detection module 410. The speech detection module 410 may receive a voice sample or voice-related sample (that is, an audio sample including some voice), for example from an electronic device 200 during an authentication attempt. In certain embodiments, the speech detection module may identify a speech frame of the sample, that is, a frame of the sample which corresponds to a time when the speaker is speaking (as opposed to silences, noises or pauses between speech). For example, with reference to the upper plot of Figure 8 which shows an example of spectrogram output data for a live utterance, a speech frame could be identified around the time value of 2.1 second as indicated by 810. A speech detector may determine whether a frame (for example, spanning 10 milliseconds) belongs to speech region based on, for instance, its energy level. Speech regions correspond to higher energy level and vice-versa. However, note that the way a speech detector may work is not limited to measurement of energy levels.

[0041] The replay detector 300 may further comprise a feature extraction module 420. The feature extraction module 420 may extract spectral information from the samples identified as being in the speech region. Spectral information is processed into different forms of features which are used in various embodiments of the present invention to differentiate live from replayed utterance. Feature extraction in relation to various embodiments of the present invention will be further described below.

[0042] The replay detector 300 may further comprise, for example in a storage module, a live voice model 430 and a replay voice model 435. In certain embodiments, the live voice model 430 may correspond to recording of one or more utterances uttered directly by a human speaker. Similarly, the replay voice model 435 may correspond to recording of utterances that are pre-recorded and later played back (for example through a loudspeaker).

[0043] It will be appreciated that the live voice model 430 and the replay voice model 435 may be generated in a number of different ways in addition to or instead of that described above. For example, each model may correspond to a data set of features extracted from known live utterances and known replay attacks. The way of how the features are obtained for each embodiments will be described in detail below. It will be appreciated that the replay voice model may not correspond to a replayed version of a specific live voice model, but rather provides a general indication of acoustic features associated with a voice sample received in a replay attack.

[0044] To give an example, the models may be generated using a probabilistic model. For instance, a Gaussian mixture model may be used for such generation. A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Gaussian mixture models will be well understood by the person skilled in the art, however as an example a technique for generating such models is described at http://scikit-learn.org/stable/modules/mixture.html.

[0045] A feature comparison module 440 may use the data obtained by the feature extraction module 420 by comparing this data with at least one of the live voice model 430 and the replay voice model 435.

[0046] The live voice model 430 and the replay voice model 435 may be configured or generated to provide data of a type corresponding to that obtained from the spectral information. That is, different embodiments of the present invention will shortly be described which involve obtaining different types of data from the spectral information, and then comparing like-types of data between that deriving from the models and that deriving from the spectral information.

[0047] In comparing the obtained data to both of the live voice model 430 and the replay voice model 435, the feature comparison module 440 may determine an indication as to whether the received sample is a live utterance (and so is from a human) or is a replayed utterance (and so is from a machine and therefore a fraudulent replay of a voice sample). For example, a likelihood score may be obtained from the result of each comparison. These likelihood scores may represent how closely the obtained data corresponds to the live voice model 430 and the replay voice model 435. These likelihood scores may then be compared to identify whether the obtained data corresponds more to one model than the other, thereby providing an indication as to whether the received sample - to which the obtained data relates - is from a human or a machine (and so, for example, indicates that the sample has previously been recorded and/or is played through a loudspeaker).

[0048] This indication may be provided to further components of the replay detector module 300 or some other module, for example the voice biometrics engine 350, for use in determining whether or not to authenticate the provider of the sample (or whether to even attempt this authentication determination, which may not be the case if the indicator strongly suggests that the sample is fraudulent).

[0049] Embodiments of the present invention will now be described to provide examples of data which may be obtained from the spectral information. It should be apparent that the use of one type of data in determining the likelihood of legitimacy (that is, the indication) does not preclude the use of another type of data. That is, a plurality of different spectral features, obtained from the extracted spectral information, may be used in determining the indication as to which model the sample corresponds to. For example, data for two or more different spectral features could be obtained, compared to corresponding data for each model, and then the results of the comparisons used, in combination, to determining the indication. These embodiments will, where relevant, be related to one of Figures 9 to 12 which show plots for certain obtained data which is obtained for a sample corresponding to a live utterance (provided by a human) and a sample corresponding to a replay utterance (provided by an apparatus, for example a loudspeaker). It will be appreciated that, to an extent, the replay sample used in generating these plots may actually correspond to a recording and replaying of the live utterance used in generating these plots, thereby illustrating the differences in spectral features which will be described. Additionally, it will be appreciated that the data for live voice samples and replay attack voice samples shown in the plots of Figures 8 to 12 can be interpreted as reflecting the properties of a live voice sample and a replay attack voice sample provided to the replay detector 300, or the properties of the live voice model 430 and the replay voice model 435 used by the replay detector 350.

[0050] In certain embodiments of the present invention, the feature extraction module 420 obtains spectral clarity data from the extracted spectral information. Spectral clarity data may be data which can highlight or distinguish degradation of the clarity of spectral features which arises due to the recording and playback process of a replay attack. That is, as a result of recording a legitimate vocalisation of the passphrase a sample provided in a replay attack may have been susceptible to some or all of channel noise, encoding/decoding losses, and also other acoustic effects which may have arisen while recording the legitimate utterance. Additionally, as described above, if the recording has been replayed using a loudspeaker then further alterations to the sample signal may have occurred. The existence of such differences in spectral features may be readily apparent from the example outputs for a live sample and a replay sample shown in Figure 8.

[0051] Some differences between a live and a replayed utterance can be clearly observed by a skilled person through spectrograms of those signals. A spectrogram is a 3D visual representation of an acoustic signal, with degrees of amplitude (represented in range from light to dark colour where light means low energy and dark means high energy), at range of frequencies (usually on the vertical axis) and time (usually on the horizontal axis). A spectrogram can be likened to a digitized image with pixels. The resolution of a resultant spectrogram depends on a size of Fourier analysis window used. Figure 8 shows spectrograms of a live and a replayed utterance.

[0052] One observable difference is that live utterances may show pronounced pitch (vertical lines on speech regions), hence peaks and troughs that are more distinctive on the spectrogram; an example of this phenomenon is depicted in Figure 9 where energy level (magnitude at specific time and frequency) along the frequency axis at time t=2.05s are plotted. Note that gradient rather than absolute magnitude is plotted to further emphasize the peaks and troughs. Notice that profile belonging to the live utterance shows a periodic and consistent zig-zag whereas the profile of the replayed utterance shows a rather random pattern. The prominence of peaks and troughs, as well as clearer separation between speech and nonspeech regions in the spectrogram of a live utterance, can also be portrayed through a higher value of standard deviation for energy levels for pixels in that region. Figure 11 shows standard deviation of a sub-window centering at time t=1.35s (this can be thought of as a specific time to which the sub-window corresponds) and measuring 3 by 7 pixels along the frequency axis, however it will be appreciated that many different sub-windows could be defined.

[0053] Another observable difference is that a replayed utterance also suffers a loss of energy at lower frequencies, which can also be deemed as a form of loss in spectral clarity. This is apparent when observing spectrogram of a live and a replayed utterance. Figure 10 captures the difference of energy levels between a live and a replayed signal at the lower frequency range. Similar to the way standard deviation is obtained from a spectrogram, a sub-cell measuring 3 by 7 cells and centering at time t=1.65s is used to calculate mean of energy levels along the frequency axis (see Figure 10). As can be seen, for the live utterance this spectral clarity data shows higher values occurring at lower frequencies than or the replayed utterance. As would therefore be apparent to the skilled person, this manner of spectral clarity data captures the loss of data at lower frequencies that results from the recording and replaying process of the replay attack.

[0054] In view of this, any one or several of the examples of differences mentioned may be obtained to provide spectral clarity data which may then be used in determining the source of a received sample. The features may be used as raw features for a subsequent data-driven classifier, where spectral information of live (human-sourced) utterances goes towards training an live model; and spectral information of replayed utterances goes towards training a replayed model. During test, a signal of unknown source will have its spectral features extracted in the same manner, and tested against the pre-trained live and replayed model.

[0055] As mentioned above, one of these three examples could be relied upon to determine the indication as to whether the sample relates to a live voice utterance or a replay voice utterance, or a combination of more than one could be used in such a determination. Embodiments employing this latter method may operate by placing equal weight on the results of each comparison (that is, a comparison between a type of spectral clarity data related to the sample and the corresponding data of the live voice model and the replay voice model). Alternatively, embodiments employing the latter method may assign different weights to the results of each comparison, thereby relying more on the result of one than the other. This may be useful in situations where it is known that the results of using one of the examples are less conclusive than the results of another example, should such a situation arise. Furthermore, according to certain embodiments the data obtained in any of the examples could be obtained at several different times. The obtained data for each time could then be tested against or compared to a live voice model and a replay voice model. Then, the results of these individual comparisons could be averaged (that is, used in combination) to determine an indication as to whether the received sample is a live utterance or a replayed utterance. For instance, the magnitude of the signal (energy levels) could be obtained for several different times within a speech region of a received voice sample.

[0056] It is further observed that ratio of energy levels between certain pairs of frequency bands are indicative of whether the source is a live or a replayed utterance. In certain embodiments of the present invention, the feature extraction module 420 is configured to obtain spectral ratio data (for example, using spectral magnitude data) from the sample during speech regions. As mentioned above, replayed utterances may have poor frequency response at the low-frequency part of the spectrum where the signal for the replayed utterance is attenuated. For example, this may arise due to replaying a recorded utterance through a loudspeaker. To highlight the discrepancy between live utterances and replay utterances which arises as a result, ratios of spectral magnitude obtained for different frequency ranges may be calculated. To provide a non-limiting example, the following ratios may be calculated: (180 to 400) Hz : (400 to 620) Hz and (90 to 720) Hz : (1090 to 1720) Hz as shown by the below formula:

Where M(f, t) is the energy level (magnitude) of the signal at frequency f and time t. R(t) is the product of ratio of average magnitude within the chosen frequency ranges at time t. The sum of R(t) across all the t in the speech regions is compared to a threshold, above which indicates the signal is a live signal (live utterance), and vice-versa. It should be noted that the use of such a threshold may preclude the need to make any comparison involving the live voice model and the replayed voice model. That is, the threshold may be set in knowledge of the results of the product of such ratios which arises from a live utterance and a replayed utterance. It will therefore be appreciated that embodiments of the present invention where said ratio are calculated therefore can be seen to use such calculations as a deterministic score (likelihood score).

[0057] In certain embodiments of the present invention, the feature extraction module 420 is configured to obtain linear frequency cepstral coefficients (LFCC). The LFCC together with its delta (derivative) and double-delta (derivative of derivative) is one of the most commonly used features in general speech processing such as speaker verification and speech recognition. The LFCC is computed by applying a bank of linearly-spaced filters on the power spectrum of the signal and taking the logarithmic compression of the amplitude. A vector of LFCC may be computed for each time frame in the speech region. In certain embodiments of the present invention, LFCC is extracted from a set of live utterances and used to train a generative model which represents the characteristics of live utterances, similar steps are done to train a replay model. When assessing a signal of unknown source, its likelihoods, when matched against the two models, give an indication of whether the signal is a live or a replay signal. It is worth noting that for normal use of the LFCCs in speech processing, cepstral mean subtraction (CMS) and cepstral variance subtraction (CVN) are applied to remove unwanted variability in the features. This means that features across all speech frames are normalised to have zero-mean and unity-variance. However, in the case of replay detection, only CVN is applied because it is observed that the change in cepstral values between LFCC of live and replayed signals could carry meaningful information. Figure 12 shows the spread of 19 static and 19 delta LFCC for the same utterances shown in figure 9.

[0058] In certain embodiments, when matching the received signal (voice sample) against the live voice model and the replay voice model, the computed vector of LFCCs may be compared to corresponding statistics of the models. That is, the skilled person will appreciate that, in generating the models, LFCCs from training data (potentially, this could be a large amount of training data used for constructing the models) may have been transformed or modified such that the models can be collectively representative of the training data used in their generation. As such the skilled person will understand that a reference to a direct comparison of an LFCC from a received signal to an LFCC obtained from one of the models may not reflect that which is literally occurring. Instead, it will be appreciated that a comparison involving LFCCs obtained from a received signal simply involves data or statistics from the live voice model or the replay voice model which is suitable for allowing the differences between a live sample and a replayed sample to become apparent (an example of these differences being apparent from Fig. 12).

[0059] In certain other embodiments of the present invention, the task of distinguishing between live and replayed utterances is likened to distinguishing between speech and nonspeech. It is reasonable to claim that the speech quality (and intelligibility) of a replayed utterance is unlikely to be greater than that of a live utterance. Therefore an utterance that exhibits a greater likelihood towards speech characteristics as opposed to nonspeech characteristics is more likely to be a live utterance. With this logic, two generative models: one trained on speech characteristic and the other on nonspeech characteristics are generated. Features used to train the models are LFCCs which may be extracted in the same manner as described in paragraphs above. Additionally, in certain of these embodiments the LFCCs are further subjected to zero-mean normalization, in addition to the unity-variance normalization indicated above. The data used for training the speech model is a large pool of speech by many speakers and from varied environments. The data used for training the nonspeech includes a large variety of nonspeech such as background noise, channel noise (e.g. clips, tone), distortion (e.g. reverberation), and so on. The difference in likelihood score to speech and to nonspeech model is taken as a measurement of how probable the signal is a live signal or a replay signal.

[0060] In view of this observation, in these embodiments of the present invention the live voice model is trained with speech data while the replay voice model is trained with nonspeech data. As a result, the degree of similarity between a received voice sample and the live voice model trained with speech data increases when the received voice sample relates to a live utterance; while that between a received voice sample and the replayed voice model trained with non-speech data increases when the received voice sample relates to a replayed utterance. This allows for a greater degree of certainty to arise when making comparisons between a received sample and one of the models. The skilled person will readily appreciate how the training of models with speech or nonspeech data may be performed. For instance, this may also be achieved using a probabilistic model such as a Gaussian mixture model.

[0061] As mentioned above, the methods described in any one of these embodiments may be used with regards to obtaining data from the spectral information. However, it should also be apparent that a combination involving any two or more of these methods may also be implemented. To provide an example, a method where an indication as to whether a received voice sample is a live utterance or a replayed utterance is determined from standard deviation of spectral magnitude data may be combined with method where an indication as to whether a received voice sample is a live utterance or a replayed utterance is determined from obtained LFCC data. As described above for the different types of data related to investigating spectral clarity, such a combination can be made whereby the likelihood result for each embodiment is assigned the same weight when determining the indication as to whether the received sample relates to a live voice utterance (from a human) or a replayed voice utterance (from an apparatus); or alternatively, different weights could be assigned to the different embodiment methods, thereby granting more weight to a likelihood score from one method than for another. For instance, further to the example described above, both indications could be given the same weight and so an average indication obtained by combining the two in a straightforward manner. Alternatively, the indication from the LFCC method could be given more weight than that from the standard deviation data.

[0062] Figure 5 shows a flowchart illustrating a method in accordance with an embodiment of the present invention.

[0063] In step 510, a voice sample is received. This sample may be received at a replay detection module 300 from an electronic device 200 used in an authentication process, for example.

[0064] Spectral information is then extracted, in step 520, from the received voice sample. As described further above, the spectral information may actually be extracted from at least one speech frame of the received sample if such has been identified from the voice sample.

[0065] The spectral information is then processed to obtain spectral clarity data, in step 530. The spectral clarity data may refer to data suitable to obtain at least one of: the time-frequency response of the at least one frame, the mean magnitude in a sub-window along a frequency axis of the at least one frame, and the standard deviation in a sub-window along a frequency axis of the at least one frame. Further detail regarding these data types is provided above where reference is made to Figs. 8 to 11.

[0066] Based on the obtained spectral clarity data, the method further comprises, according to step 540, matching or comparing the features to models pre-trained on live and replay signals; hereby determining an indication regarding whether the sample relates to a live utterance or a replay utterance in step 550.

[0067] As described above, the models may be generated using probabilistic methods involving live speech samples (emanating from a human source) and replayed speech samples (emanating from an apparatus or machine, e.g. a loudspeaker or a smartphone, via a telephony channel). Depending on the spectral clarity feature(s) obtained from the received voice sample (for example, standard deviation data as described above), data representing the same feature(s) such that an apt comparison can be made between the data from received voice sample and that from the models. The results of each comparison may be used to determine a likelihood as to whether the received voice sample is a live utterance (spoken by a human) or a replayed utterance (provided by an apparatus).

[0068] Figure 6 shows a flowchart illustrating a method in accordance with another embodiment of the present invention.

[0069] In step 610, a voice sample is received. This sample may be received at a replay detection module 300 from an electronic device 200 used in an authentication process, for example.

[0070] At step 620, spectral information is extracted from the received voice sample. For instance, the power spectrum of the received signal (the voice sample) may be determined for a speech region of the received voice sample.

[0071] In step 630, the spectral information is processed to obtain a plurality of unity-variance normalized LFCCs. For instance, a bank of linearly-spaced filters may be applied to a determined power spectrum of the received signal, and the logarithmic compression of the amplitude taken. A vector of LFCC may therefore be computed for each time frame in a speech region of the received voice sample. The derivatives of the obtained LFCCs may also be calculated.

[0072] In step 640, matching (i.e. a comparison) is performed between the LFCCs determined from (i.e. computed for) the received voice sample and corresponding statistics from( at least one of) a live voice model and a replay voice model, these models having been trained to be representative of the characteristics of live signals (live utterances) and of replayed signals (replayed utterances), respectively. The live voice model and the replay voice model may have been generated using probabilistic methods such as described above, however the skilled person would appreciate that other options exist.

[0073] In step 650, it is determined whether or not the received voice sample relates to a live utterance of a replayed utterance. This determination may be based on an indication provided by the results of the comparison between the received signal LFCCs and the corresponding statistics of the different models.

[0074] Figure 7 shows a flowchart illustrating a method in accordance with another embodiment of the present invention.

[0075] In step 710, a voice sample is received. This sample may be received at a replay detection module 300 from an electronic device 200 used in an authentication process, for example.

[0076] In step 720, spectral information is extracted from the received voice sample. For instance, the power spectrum of the received signal (the voice sample) may be determined for a speech region of the received voice sample.

[0077] In step 730, the spectral information is processed to obtain a plurality of unity-variance normalized and zero-mean normalized LFCCs. Obtaining or calculating the LFCCs or vector of LFCCs from the spectral information may be achieved as described with regards to Fig. 6, albeit with the additional step of zero-mean normalizing the LFCCs. The derivatives of the obtained LFCCs may also be calculated.

[0078] In step 740, the obtained LFCCs and their derivatives are then compared with corresponding statistics from a live voice model and a replayed voice model. These generative models are trained with speech data and nonspeech data, respectively. As described above, this training may be performed by extracting LFCC data from known live utterances (e.g. a large pool of speech from various speakers in various environments) and replayed utterances (e.g. background noise, channel noise, distortion etc.) and using the extracted LFCC data to train the relevant model.

[0079] In step 750, an indication, based on the result of the comparison, is determined as to whether the received voice sample relates to speech and hence a live utterance or nonspeech and so a replayed utterance. For instance, this indication could be based on a likelihood of a match computed from the comparison with each model.

[0080] Figure 13 show a method by which an indication as to whether a received voice sample relates to a live utterance or a replayed utterance may be obtained. Such a method may form the sole basis of a determination of the legitimacy of a user; or may be combined with another method by which a further indication may be obtained, such as those illustrated in each of Figs. 5 to 7.

[0081] In step 910, a voice sample is received. This sample may be received at a replay detection module 300 from an electronic device 200 used in an authentication process, for example. In step 920, spectral information is extracted from the received voice sample.

[0082] In step 930, the spectral information is processed to obtain spectral magnitude data for different frequency ranges. These frequency ranges may be defined for a speech region of the received voice sample, and, for example, may include a first and second frequency ranges where the first frequency range includes lower frequencies than those included in the second frequency range.

[0083] In step 940, the spectral magnitude data is used to calculate a ratio of one set of spectral magnitude data to another (that is, the spectral magnitude data for one frequency range to the spectral magnitude data for another frequency range). It will be appreciated that a number of such calculations may be performed; involving spectral magnitude data for different frequency ranges in the same speech region (or a same time slot of the speech region) or involving spectral magnitude data from a different speech region (or a different time slot of the same speech region).

[0084] In step 950, the calculated ratio(s) may be compared to a threshold to determine an indication as to whether or not the received voic sample relates to a live utterance or to a replayed utterance.

[0085] Throughout the description and claims of this specification, the words “comprise” and “contain” and variations of them mean “including but not limited to”, and they are not intended to (and do not) exclude other components, integers or steps. Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise.

[0086] Features, integers and characteristics, described in conjunction with a particular aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

[0087] The reader's attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

[0088] The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A method for detecting a replay attack in a biometric system, the method comprising: receiving a voice sample; extracting spectral information from the voice sample; processing the spectral information to obtain data corresponding to at least one of: signal magnitude at a specific time; the mean of signal magnitude for a sub-window defining a frequency and time range; and the standard deviation of signal magnitude for a sub-window defining a frequency and time range; and determining, based on the obtained data, an indication regarding whether the sample relates to a live utterance or a replayed utterance.

2. The method of claim 1, wherein the spectral information is extracted from at least one frame of the voice sample; and wherein the at least one frame corresponds to a speech region of the voice sample.

3. The method of claim 1 or claim 2, wherein the determining comprises: comparing the obtained data against corresponding data obtained from at least one of a live voice model and a replay voice model; and determining the indication based on the result of the comparison.

4. The method of claim 3, wherein the live voice model and the replay voice model are generated using probabilistic models.

5. The method of claim 3 or claim 4, wherein the specific time is one of a plurality of predetermined times; wherein the spectral information is processed to obtain data corresponding to each of the predetermined times; and wherein determining the indication further comprises: comparing the obtained data for each of the predetermined times against corresponding data obtained from at least one of the live voice model and the replay voice model; and determining the indication based on a combination of the results of the comparisons.

6. A method for detecting a replay attack in a biometric system, the method comprising: receiving a voice sample; extracting spectral information from the voice sample; processing the spectral information to obtain a plurality of Linear Frequency Cepstral Coefficients, LFCCs, to which unity-variance normalisation is applied; and determining, based on a comparison between the obtained LFCCs and at least one of a live voice model and a replay voice model, an indication regarding whether the sample relates to a live utterance or a replayed utterance.

7. The method of claim 6, wherein the spectral information is extracted from at least one frame of the voice sample; and wherein the at least one frame corresponds to a speech region of the voice sample.

8. The method of claim 6 or claim 7, wherein the determining comprises: comparing at least one of the obtained LFCCs and delta LFCCs derived from the obtained LFCCs with corresponding statistics from at least one of the live voice model and the replay voice model; and determining the indication based on the result of the comparison.

9. The method of claim 6 or claim 7, wherein the determining comprises: comparing the obtained LFCCs with corresponding statistics from at least one of the live voice model and the replay voice model; and determining the indication based on the result of the comparison; wherein the live voice model is trained with speech related data and the replay voice model is trained with non-speech related data.

10. The method of claim 9, wherein zero-mean normalization is further applied to the obtained LFCCs; and wherein comparing the obtained LFCCs comprises comparing the zero-mean normalized LFCCs to corresponding statistics from at least one of the live voice model and the replay voice model.

11. The method of any of claims 8 to 10, wherein the live voice model and the replay voice model are generated using probabilistic models.

12. The method of any previous claim, wherein if it is indicated that the received sample relates to the live utterance, the method further comprises determining whether or not to authenticate on the basis of the received sample; and wherein if it is indicated that the received sample relates to the replay utterance, the method further comprises denying authentication.

13. The method of any one of claims 6 to 12, further comprising: processing the spectral information to obtain data corresponding to at least one of: signal magnitude at a specific time; the mean of signal magnitude for a sub-window defining a frequency and time range; and the standard deviation of signal magnitude for a sub-window defining a frequency and time range; and determining, based on the obtained data, a further indication regarding whether the sample relates to a live utterance or a replayed utterance

14. The method of any previous claim, further comprising: processing the spectral information to obtain a plurality of ratios of signal magnitude at a specific time, wherein each ratio is calculated between two frequency ranges; calculating at least one product of at least two of the calculated ratios; and determining, based on the at least one product, a further indication regarding whether or not the received sample relates to a live utterance or a replayed utterance.

15. The method of claim 14, wherein determining a further indication comprises comparing the product to a threshold; wherein the indication is based on the result of the comparison.

16. The method of claim 14 or claim 15, wherein the specific time is one of a plurality of predetermined times; and wherein determining the further indication is based on a combination of a plurality of at least product, said at least one product having been calculated for each of the predetermined times.

17. A replay detector apparatus arranged to perform the method of any preceding claim.