The present application claims priority from U.S. provisional patent application Ser. No. 63/497,941, filed 24 at 2023, and European patent application No. 23181973.1, filed 28 at 2023, each of which is incorporated herein by reference in its entirety.
Disclosure of Invention
In view of at least some of these needs, the present disclosure provides methods and apparatus for configuring a Deep Neural Network (DNN) for estimating an indication of subjective listening scores, methods for estimating an indication of subjective listening scores using DNN, and corresponding apparatus, program, and computer-readable storage medium.
The present disclosure also provides methods and apparatus for evaluating playout performance in an adaptive streaming environment, methods of providing playout-related information, and corresponding apparatus, programs, and computer-readable storage media.
An aspect of the present disclosure relates to a method of configuring DNN for estimating an indication of subjective listening scores of an audio signal (e.g., a test audio signal). The method may be, for example, a method of training DNNs. The listening score may be a score according to a listening test performed according to a predefined listening test methodology. The predefined listening test methodology may be a standardized listening test methodology. Additionally, the listening test may apply predefined test metrics and/or test scenarios. The method may include providing an output stage of DNN that generates an indication of a listening score. The method may further include training the DNN by entering one or more training data items in a training round among a plurality of training rounds, each training data item indicating a respective value of the listening score. Training the DNN may also include determining respective indications of the listening scores based on one or more training data items in a training round. Training the DNN may further comprise determining respective loss values for one or more training data items by evaluating the loss function in a training round. Here, the loss function may depend on an indication of the listening score. Training the DNN may also include adjusting one or more internal parameters of the DNN in a training round based on the determined loss value. The internal parameters of the DNN may be model parameters, for example, coefficients (e.g., filter coefficients) of multiple layers such as the DNN.
Thus, the DNN is trained not on the mean of subjective listening scores but on individual listening scores. This allows adapting the DNN to parameters other than the mean listening score, including, for example, probability distribution, standard deviation, and/or confidence interval of the listening score.
In some embodiments, the indication of the listening score may be related to a probability distribution of the listening score, wherein the output stage is adapted to generate the probability distribution of the listening score. The probability distribution may simulate a listening score obtained by a plurality of listening tests on the audio signal. The plurality of listening tests simulated by the probability distribution may be independent listening tests. In addition, the probability distribution may be parameterized by two or more parameters of the probability distribution. Then, determining respective indications of the listening scores based on the one or more training data items may include determining respective parameters of the probability distribution based on the one or more training data items. The parameter determining the probability distribution may be based at least in part on the value of the subjective listening score. In addition, the parameter determining the probability distribution may be based on a current state of the DNN, e.g. a current value of an internal parameter of the DNN. The loss function may depend on the parameters of the distribution.
In some embodiments, training DNNs may be based on maximum likelihood principles.
In some embodiments, the penalty function may be associated with a negative log likelihood NLL penalty.
In some embodiments, the negative log likelihood loss may be determined byGiven, whereinIs the probability distribution of the test score s given the representation of the audio signal y and the representation of the reference audio signal x of the audio signal y, and θ indicates the internal parameter of DNN.
Using the principle of maximum likelihood, or correspondingly, using negative log likelihood loss, allows DNNs to be efficiently trained based on individual hearing scores to provide an indication of the hearing scores that can be expected in an actual subjective hearing test with multiple listeners.
In some embodiments, the probability distribution may be equal to the passing mean μ and varianceThe parameterized gaussian distribution is correlated. Then, the loss functionCan be composed ofGiven, where c is a constant and s is a subjective listening score. The constant c may be defined by, for exampleGiven.
In some embodiments, the probability distribution may be related to a logistic distribution parameterized by the mean μ and the scale a. Then, the loss functionCan be composed ofGiven, where c is a constant and s is a subjective listening score. The constant c may be defined by, for exampleGiven.
These two parameterizations of the probability distribution have been found to provide efficient training at the training stage and meaningful output when inferred.
In some embodiments, the training data item may also indicate a representation of the audio signal and a representation of a reference audio signal of the audio signal.
Thus, the DNN under consideration may be configured to automate an intrusive listening test with the specific characteristics and advantages listed above.
In some embodiments, the representation of the audio signal and the representation of the reference audio signal may be related to a gammatine spectrogram.
Gammatine spectrograms are auditory features that are specifically adapted to human hearing and perception, and thus allow meaningful results to be achieved with reduced computational complexity.
In some embodiments, the predefined listening test may be, for example, a multi-stimulus test (MUSHRA) listening test with hidden references and anchor points as standardized according to ITU-R recommendation bs.1534.
In some embodiments, the DNN may implement a generative model.
Another aspect of the disclosure relates to a method of estimating an indication of subjective listening scores of an audio signal using DNN. The listening score may be a score according to a predefined listening test. The DNN may comprise an input stage for receiving a representation of the audio signal and a representation of a reference audio signal of the audio signal. The DNN may also include a plurality of layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal. The processing by the multiple layers may also be based on the current state of the DNN, e.g. the current value of an internal parameter of the DNN. The DNN may further comprise an output stage connected to a last layer of the plurality of layers for generating an indication of the listening score. The method may include inputting a representation of the audio signal and a representation of the reference audio signal to the input stage. The method may further include determining a representation of the indication of the listening score based on an output of the output stage.
In some embodiments, the indication of the listening score may be related to a probability distribution of the listening score, wherein the output stage of the DNN is adapted to generate the probability distribution of the listening score. The probability distribution may simulate the listening scores obtained by a plurality of (subjective) listening tests on the audio signal. The probability distribution may be parameterized by two or more parameters of the probability distribution.
In some embodiments, determining the representation of the probability distribution may include determining at least one of a mean, a standard deviation, and a confidence interval from the output of the output stage.
In some embodiments, the confidence interval may be determined based on the output of the output stage and the number of listeners (e.g., counts) of the listening test to be emulated.
In some embodiments, the probability distribution may be equal to the passing mean μ and varianceThe parameterized gaussian distribution is related or related to a logistic distribution parameterized by the mean μ and the scale a.
In some embodiments, the representation of the audio signal and the representation of the reference audio signal may be related to a gammatine spectrogram.
In some embodiments, the predefined listening test may be a MUSHRA listening test.
Another aspect of the disclosure relates to a DNN for estimating an indication of subjective listening scores of an audio signal. The listening score may be a score according to a predefined listening test. The DNN may include an input stage for receiving a representation of an audio signal and a representation of a reference audio signal of the audio signal. The DNN may also include a plurality of layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal. The processing by the multiple layers may also be based on the current state of the DNN, e.g. the current value of an internal parameter of the DNN. The DNN may also include an output stage coupled to a last layer of the plurality of layers for generating an indication of the listening score.
In some embodiments, the DNN may have been configured by inputting one or more training data items in a training round among a plurality of training rounds, each training data item indicating a respective value of the listening score. Training the DNN in the training round may further include determining a respective indication of a listening score based on one or more training data items. Training the DNN in the training round may further comprise determining respective loss values for one or more training data items by evaluating the loss function. This loss function may depend on an indication of the listening score.
Training the DNN in the training round may also include adjusting one or more internal parameters of the DNN based on the determined loss value. Configuring the DNN may involve or correspond to obtaining internal parameters of the DNN by training the DNN.
In some embodiments, the indication of the listening score may be related to a probability distribution of the listening score, wherein the output stage of the DNN is adapted to generate the probability distribution of the listening score. The probability distribution may simulate a listening score obtained by a plurality of listening tests on the audio signal. In addition, the probability distribution may be parameterized by two or more parameters of the probability distribution.
In some embodiments, determining the respective indications of the listening scores based on the one or more training data items may include determining respective parameters of the probability distribution based on the one or more training data items. The parameter determining the probability distribution may be based at least in part on the value of the subjective listening score. In addition, the parameter determining the probability distribution may be based on a current state of the DNN, e.g. a current value of an internal parameter of the DNN. The loss function may depend on the parameters of the distribution.
In some embodiments, the probability distribution may be equal to the passing mean μ and varianceThe parameterized gaussian distribution is related or related to a logistic distribution parameterized by the mean μ and the scale a.
In some embodiments, the representation of the audio signal and the representation of the reference audio signal may be related to the gammap(s).
In some embodiments, the predefined listening test may be a MUSHRA listening test.
Another aspect of the disclosure relates to an apparatus that includes a processor and a memory coupled to the processor and storing instructions for the processor. The processor may be adapted to perform the methods according to the foregoing aspects and their embodiments.
Another aspect of the disclosure relates to an apparatus that includes a processor and a memory coupled to the processor and storing instructions for the processor. The processor may be adapted to execute DNNs according to the foregoing aspects and their embodiments.
Another aspect of the disclosure relates to a program comprising instructions which, when executed by a processor, cause the processor to perform a method according to the preceding aspects and their embodiments.
Another aspect of the disclosure relates to a program comprising instructions which, when executed by a processor, cause the processor to implement DNNs according to the foregoing aspects and embodiments thereof.
Another aspect of the disclosure relates to a computer-readable storage medium storing any one of the aforementioned programs.
Another aspect of the disclosure is directed to a method of evaluating playout performance in an adaptive streaming environment. The playout performance may be related to the (subjective) playout quality. The method may include obtaining playout related information from a streaming client. The playout related information may for example comprise, correspond to, or be in the form of metadata. The method may further comprise estimating a representation of the test audio signal based on the playout related information. The test audio signal may be an audio signal played out by a streaming client. The estimating the representation of the test audio signal may relate to or correspond to reconstructing the test audio signal or a representation thereof. The representation may for example relate to a set of features or spectrograms of the test audio signal. The method may further include determining an estimate of the audio quality of the test audio signal based on the representation of the estimated test audio signal using an audio quality assessment algorithm. The audio quality assessment algorithm may be an objective audio quality assessment algorithm. In addition, the audio quality assessment algorithm may simulate, for example, an intrusive audio quality test (e.g., a listening test), such as a mushara test. It is understood that the representation of the estimated test audio signal is in a form suitable for input to an audio quality assessment algorithm. For example, the representation of the test audio signal may be associated with a predefined number of segments to accommodate requirements of the audio quality assessment algorithm in relation to the time span of the input audio signal or its representation coverage.
Thus, the proposed method allows to estimate the result of an intrusive listening test without requiring knowledge of the reference signal at the streaming client. Configured as described above, the intrusive listening test may be performed at a network node remote from the streaming client. The streaming client need only provide lightweight metadata to the network node performing the test. Using the metadata, both a version of the played out audio and a reference to the played out audio may be derived at the network node. As a result, the proposed method can produce meaningful and easily interpretable estimates of the audio quality of audio content played out by a streaming client without requiring a significant additional signaling overhead to or from the streaming client.
In some embodiments, the method may further comprise generating a representation of a reference audio signal of the test audio signal. This may include obtaining (e.g., receiving) audio content or a representation thereof from a content repository. It is understood that the representation of the reference audio signal is in a form suitable for input to an audio quality assessment algorithm. For example, the representation of the reference audio signal may be associated with a predefined number of segments to accommodate requirements of the audio quality assessment algorithm in relation to the time span of the input audio signal or its representation coverage.
In some embodiments, the method may further include obtaining an indication of the audio content processed by the streaming client from the streaming client. The indication of the audio content may include an identifier of the audio content (such as a file name, etc.), a bit rate level of the audio content, and/or information about segments of the audio content. The indication of the audio content may be used to determine a series of audio clips received by the streaming client. The audio content processed by the streaming client may be audio content received by the streaming client, for example, from a content delivery network.
In some embodiments, estimating the representation of the test audio signal may also be based on an indication of the audio content.
In some embodiments, generating the representation of the reference audio signal may be based on an indication of the audio content. This may include obtaining (e.g., receiving) audio content or a representation thereof from a content repository based on an indication of the audio content played out by the streaming client.
In some embodiments, the playout-related information may include bit rate information indicating a bit rate of an audio signal played out by the streaming client. The representation of the test audio signal may then be estimated based on the bit rate information. The bit rate information may be provided for each of a plurality of segments of the audio signal played out by the streaming client.
In some embodiments, the audio quality assessment algorithm may use a set of pre-trained models for audio quality assessment. Generating an estimate of audio quality may include selecting a pre-trained model among the set of pre-trained models based on playout-related information.
Thereby, it may be ensured that an optimal model (e.g. a model specifically trained for each relevant situation) is used for each situation, thereby improving the reliability of the estimation of playout performance.
In some embodiments, the playout related information may include information related to a playout device associated with the streaming client. The pre-trained model may then be selected based on information related to the playout device. The information related to the playout device may include an indication of the playout device (e.g., headphones, a sound bar, a discrete speaker, etc.) and/or an indication of a characteristic of the playout condition (e.g., SNR, etc.).
Thus, an appropriate model for audio quality assessment may be used for each of a plurality of different playout device configurations, thereby improving the reliability of the estimation of playout performance.
In some embodiments, the audio quality assessment algorithm may be implemented by a deep neural network DNN for estimating an indication of the subjective listening score for a representation of the test audio signal as an estimate of the audio quality. The listening score may be a score according to a predefined listening test. The DNN may comprise an input stage for receiving a representation of the test audio signal and a representation of a reference audio signal of the test audio signal. The DNN may also include a plurality of layers for performing processing based on the representation of the test audio signal and the representation of the reference audio signal. The DNN may also include an output stage for generating an indication of the listening score.
In some embodiments, the DNN may have been configured by inputting one or more training data items in a training round among a plurality of training rounds, each training data item indicating a respective value of the listening score. The DNN may also have been configured by determining a respective indication of a listening score based on one or more training data items in a training round. The DNN may also have been configured by determining respective loss values for one or more training data items in a training round by evaluating a loss function, wherein the loss function depends on the indication of the hearing score. The DNN may also have been configured by adjusting one or more internal parameters of the DNN based on the determined loss value in a training round.
In some embodiments, the method may be implemented at a different network node than the streaming client.
In some embodiments, the estimated representation of the test audio signal may be correlated with one or more gammatine spectrograms.
In some embodiments, the representation of the reference audio signal may be associated with one or more gammatine spectrograms.
In some embodiments, the method may further include outputting an estimate of the audio quality of the test audio signal to a network node different from the network node associated with the streaming client. The network nodes may be, for example, network nodes in a cloud-based framework.
In some embodiments, an estimate of the audio quality of the test audio signal may be output to a network node for performing encoding and/or packaging of the audio content. The method may then further comprise optimizing the encoding and/or packaging based on the estimation of the audio quality of the test audio signal.
In some embodiments, the method may further comprise determining an optimal number of quality levels in the bitrate ladder for distribution over the content delivery network based on the estimation of the audio quality of the test audio signal. This may for example involve inputting an estimate of the audio quality of the test audio signal to an optimal number of utility functions for determining the quality level.
In some embodiments, the method may further comprise determining a configuration of the encoding tools and/or a set of encoding tools based on an estimate of the audio quality of the test audio signal.
In some embodiments, the method may further include determining an estimate of the audio quality of the test audio signal for the streaming client in each of the plurality of groups of streaming clients. The method may further include comparing the estimates of the audio quality determined for the plurality of groups of streaming clients.
This may allow comparing different content delivery methods and/or playout methods for determining an optimal delivery method and/or playout method.
Another aspect of the disclosure relates to a method of providing playout related information at a streaming client processing audio content in an adaptive streaming environment. The method may include generating playout-related information by one or more of analyzing a playout buffer associated with the streaming client to determine bit rate information indicative of a bit rate of a segment of the audio signal played out by the streaming client, analyzing manifest information associated with the audio content, and analyzing characteristics of a playout device associated with the streaming client. The method may further include outputting playout related information to a network node different from the network node associated with the streaming client.
Another aspect of the disclosure relates to an apparatus that includes a processor and a memory coupled to the processor and storing instructions for the processor. The processor may be adapted to perform the method according to any of the two preceding aspects and their embodiments.
Another aspect of the disclosure relates to a program comprising instructions which, when executed by a processor, cause the processor to perform a method according to the two aspects and embodiments mentioned above.
Another aspect of the disclosure relates to a computer-readable storage medium storing the program of the preceding aspect.
It should be noted that the methods and systems as outlined in the present disclosure, including the preferred embodiments thereof, may be used alone or in combination with other methods and systems disclosed in this document. Furthermore, all aspects of the methods, apparatus, and systems outlined in the present disclosure may be arbitrarily combined.
In particular, the features of the claims may be combined with each other in any way.
It will be appreciated that the apparatus features and method steps may be interchanged in many ways. In particular, as will be recognized by a person skilled in the art, the details of the disclosed method may be realized by corresponding means, and vice versa. Moreover, any of the above statements made with respect to methods (and e.g. their steps) are understood to apply equally to the corresponding devices (and e.g. their blocks, stages, units) and vice versa.
Detailed Description
The present disclosure relates to techniques for estimating audio quality of broadcasted content in an adaptive streaming environment, and to techniques for configuring (e.g., training) and using DNNs for estimating audio quality, which will be described in turn.
Machine listener based audio streaming quality
Broadly, a portion of the present disclosure relates to techniques (e.g., methods, apparatuses, and systems) for performing objective quality testing of audio in the context of content streaming (e.g., adaptive streaming, and in particular adaptive audio streaming) over the internet, for example. A system implementing such techniques may include a cloud-based service that receives playout-related information (e.g., playout-related parameters) from a client (streaming client) or a group of clients (i.e., a community of clients) and then calculates an objective quality score by simulating a subjective quality assessment test, e.g., according to the MUSHRA methodology (e.g., as standardized according to ITU-R recommendation bs.1534).
The proposed technique facilitates the evaluation of the audio quality from encoding to actual client playout. They also allow groups of clients to be rated in terms of the audio quality delivered to these clients.
In addition, the proposed techniques may be used for monitoring of audio experiences. These techniques may also be used to perform a/B testing (e.g., bucket testing or shunt testing) on a population of real world streaming clients (e.g., experiments to compare the performance of bit rate ladders and/or codecs). The system may perform quality analysis in an online mode and an offline setting. In online mode, audio quality may be estimated in real-time from the progress of the streaming media based on a feedback channel between the client and the service. In an offline setting, the service may first collect all parameter data from clients participating in the experiment and then perform an analysis based on the data.
Note that the proposed technique can be applied to a so-called generation type machine listener described later in this disclosure. As will be explained in more detail later, the generating machine listener is a neural network (e.g., DNN) trained to evaluate audio by comparing the audio with a reference signal and providing an evaluation result (e.g., as a probability distribution with a mean value corresponding to a MUSHRA score and a confidence value corresponding to a confidence interval that is expected in a listening test performed on such material).
Definition of the definition
Invasive quality assessment requires access to reference signals and test signals. Mature subjective test methodology uses this approach. There are non-invasive quality methods that only require access to the test signal. However, they are unreliable and the results may be difficult to interpret.
Objective quality assessment algorithms facilitate estimation of the quality of experience for human observers without the use of human observers. For example, a generating machine listener performs objective quality assessment by predicting the quality scores that would be achieved in subjective tests with human observers. In particular, the machine listener facilitates, for example, the estimation of the mean performance score along with the associated confidence interval.
Adaptive streaming is a content delivery method in which content is available in multiple quality versions associated with different bit rates. The higher the bit rate, the higher the quality. The content player includes a policy that attempts to determine the highest possible bit rate that results in their timely delivery before the clips should be played out. In other words, the adaptive streaming strategy attempts to maximize the quality of experience (e.g., content segments are downloaded, inserted into the playout buffer, and set up for playout) while maintaining the probability of playout buffer exhaustion below some reasonably low threshold (i.e., the probability of rebuffering remains small).
A bitrate ladder is a set of versions of content associated with different quality versions of the content. The bit rate of the content in the bit rate ladder is designed to facilitate streaming media under a wide variety of throughput scenarios (e.g., very low to very high throughput). The adaptive streaming strategy will select the appropriate quality level from the bit rate ladder per segment. Information about the bit rate ladder is typically supplied to the content player in a so-called manifest.
Description of example embodiments
Fig. 1 depicts an example of a quality assessment service that incorporates a machine listener in a framework for adaptive streaming (e.g., as part of the quality assessment service). The machine listener facilitates an intrusive quality assessment of the audio experience at the client without providing the client with an unencoded reference.
As mentioned above, quality assessment using intrusive algorithms in a (adaptive) streaming environment is hampered by the fact that streaming clients typically cannot access reference signals. Providing a streaming media client with a reference signal will typically require out-of-band placement of the reference signal, which is strongly disagreeable by bandwidth limitations.
Embodiments of the present disclosure facilitate performance of intrusive tests, e.g., within a cloud or network service, at nodes (test nodes, network nodes) that can supply reference signals. Instead of sending the client playout signal upstream to the test node, the test signal is reassembled at the test node based on the lightweight playout-related metadata that can be obtained by the instrumented client and then sent upstream to the service.
In the example of fig. 1, the streaming client 10 receives encoded audio content 5 (e.g., audio content or video content with associated audio content) from a Content Delivery Network (CDN) 105, for example, via the internet. The CDN 105 may provide different versions of a given content, e.g., at different bitrates (e.g., using different settings within a predefined bitrate ladder), depending on streaming client configuration and/or network conditions, etc. On the other hand, streaming client 10 may be configured to employ adaptive bitrate control to request content at different bitrates to maximize playout quality and/or user experience.
After appropriate decoding, the streaming client 10 plays out the audio content via a play-out buffer, e.g. in a segment-by-segment manner. At the same time, the streaming client 10 performs a playout analysis, e.g., via a playout analysis block 120, to generate playout related information 20 (e.g., playout related metadata or playout metadata). In this sense, the streaming client 10 acts as, implements, or includes an instrumented client that collects and forwards playout related information 20. The playout related information 20 is provided to a quality assessment service 150 (e.g., a machine listener service) or retrieved by the quality assessment service 150 (e.g., a machine listener service). In general, it can be said that the playout related information 20 is provided to or retrieved by the test node. In addition, an indication of the audio content processed by the streaming client 10 is provided to or retrieved by the quality assessment service 150 (or test node) to enable the quality assessment service 150 to generate a reference signal related to the content processed by the streaming client 10 for intrusive quality assessment. Here, the indication of the audio content may comprise an identifier of the audio content, such as a file name (e.g. a file name of an audio clip), etc., a bit rate level of the audio content and/or information about the (current) segment of the audio content. The indication of the audio content may be used to determine a series of audio clips that are received and played out by the streaming client 10. The audio content processed by the streaming client 10 may be audio content received by the streaming client 10, e.g., from the CDN 105. For example, the indication of the audio content may be obtained by blocking requests from the streaming media client 10 to the CDN 105.
The quality assessment service 150 may be in the form of a web service or a cloud service. In addition, the quality assessment service 150 may be configured, for example, to perform a method of evaluating playout performance in an adaptive streaming environment, such as the method 200 described below. To this end, the quality assessment service 150 may include a trained network 40 and a model selector 145 for selecting an appropriately trained model among a set of models based on the playout-related information 20. The quality assessment service 150 may also include a recreated test signal block 130 for estimating the test signal 30, and a reference lookup block 160 for estimating the reference signal 60 of the test signal 30.
An example of a method 200 of evaluating playout performance in an adaptive streaming environment (e.g., by the quality assessment service 150 of fig. 1) is illustrated in the flow chart of fig. 2. The playout performance may be related to, for example, the (subjective) playout quality. The method 200 comprises steps S210 to S230 and an optional step S240. The method may be implemented at a different network node (e.g., test node) than the streaming client 10. In addition, it may be implemented at a different network node than the CDN 105. Here, the network node may be a network node in a cloud-based framework, for example.
At step S210, playout related information is obtained from the streaming client. The playout related information may for example comprise metadata, correspond to metadata or be in the form of metadata.
At step S220, a representation of the test audio signal is estimated based on the playout related information. Here, the test audio signal is an audio signal played out by the streaming client. The estimating the representation of the test audio signal may relate to or correspond to reconstructing the test audio signal or a representation thereof. The representation may be related to a set of features or spectrograms (e.g., gammatine spectrograms) of the test audio signal, for example.
At step S230, an estimate of the audio quality of the test audio signal is determined based on the representation of the estimated test audio signal using an audio quality assessment algorithm. The audio quality assessment algorithm may be an objective audio quality assessment algorithm. In addition, the audio quality assessment algorithm may simulate, for example, an intrusive audio quality test (e.g., a listening test), such as a mushara test. It is understood that the representation of the estimated test audio signal is in a form suitable for input to an audio quality assessment algorithm. For example, the representation of the test audio signal may be associated with a predefined number of segments to accommodate requirements of the audio quality assessment algorithm in relation to the time span of the input audio signal or its representation coverage.
At step S240, which may be optional, an estimate of the audio quality of the test audio signal is output to a network node different from the network node associated with the streaming client. Non-limiting examples of using an estimate of the audio quality of the test audio signal will be described below with reference to fig. 6, 7, 8, 9, 10 and 11.
Configured as described above, the technology according to the present disclosure performs streaming media quality evaluation using a quality evaluation service (including, for example, a machine listener) as a cloud service. The quality assessment service (e.g., machine listener) is a component that is independent of the content delivery system and independent of the streaming client. The quality assessment service (e.g., machine listener) has the following properties:
It allows the client device to perform an intrusive quality test of the playout without the need to supply the reference signal to the client device.
It allows to perform objective quality assessment, which can for example simulate mature subjective test methodologies such as MUSHRA.
By appropriate selection of the audio quality assessment algorithm, it can be provided as a result of the probability distribution (e.g., MUSHRA score + confidence interval).
Potential applications and advantages of techniques according to embodiments of the present disclosure may include the following:
Techniques according to the disclosed embodiments facilitate delivery with a generic CDN and allow decoupling of the quality assessment system from the CDN infrastructure, which may be advantageous. Thus, CDNs generally do not need to involve operating machine listener services, and there is no need to store reference signals in CDNs. The proposed technology also facilitates multi-CDN delivery.
Techniques according to the disclosed embodiments facilitate experiments such as evaluation of scenarios where quality estimates cannot be pre-computed. One example of such a scenario is a scenario where the number of ABR ladder level combinations in the playout buffer and the number of different ways of playout (e.g., speakers in a handheld device, headphones, discrete speakers) may be excessively large.
An example application of the system shown in the example of fig. 1 is AB testing of bit rate ladder or codec in a real-world content delivery scenario (e.g., a real population of clients in a real content delivery scenario).
Configured as described above, systems and methods according to the disclosed embodiments include components/steps that allow for simulating an intrusive listening test (e.g., a mushara test) by operating a (generative) machine listener in the cloud, where the test signal is reconstructed (or partially reconstructed) within the service by sending playout related metadata from an instrumented client using a feedback channel. In addition, these systems and methods require selection of an appropriate model for use by a (generative) machine listener from a collection of pre-trained models based on playout-related metadata.
In addition to the above, the method 200 may further comprise the step of generating a representation of a reference audio signal of the test audio signal for use by an audio quality assessment algorithm (not shown in fig. 2). This may include obtaining (e.g., receiving) audio content or a representation thereof from a content repository (or content source in general). It is understood that the representation of the reference audio signal should be in a form suitable for input to an audio quality assessment algorithm. For example, the representation of the reference audio signal may be associated with a predefined number of segments to accommodate requirements of the audio quality assessment algorithm in relation to the time span of the input audio signal or its representation coverage.
In addition to playout related information 20, the quality assessment service 150 may also require an indication (e.g., identification) of the audio content processed by the streaming client 10. Thus, the method 200 may further comprise the step of obtaining an indication of the audio content processed by the streaming client 10 from the streaming client 10 (not shown in fig. 2). As mentioned above, the indication of the audio content may include an identifier of the audio content (such as a file name, etc.), a bit rate level of the audio content, and/or information about the segments of the audio content. The indication of the audio content may be used to determine a series of audio clips received by the streaming client 10. The audio content processed by the streaming client 10 may be audio content received by the streaming client 10, e.g., from the CDN 105. Then, in case an indication of audio content is available, estimating the representation of the test audio signal at step S220 may also be based on the indication of audio content. In addition, the generation of the representation of the reference audio signal may also be based on an indication of the audio content. This may include obtaining (e.g., receiving) audio content or a representation thereof from a content repository based on an indication of the audio content played out by the streaming client.
If the playout related information 20 comprises bit rate information indicating the bit rate of the audio signal played out by the streaming client 10, estimating the representation of the test audio signal at step S220 may be based on the bit rate information. This bit rate information may be provided for each of a plurality of segments of the audio signal played out by the streaming client 10.
An example process for estimating representations of test signals and reference signals will be described below with reference to fig. 5.
Fig. 3A illustrates an example of a streaming client 10 that collects playout-related information (e.g., playout-related metadata) that is then aggregated and sent upstream to facilitate operation of a quality assessment service 150 (e.g., a machine listener service). Thus, FIG. 3A relates to the operation of an instrumented client implemented or included by streaming client 10.
An example of a corresponding method 400 of providing playout-related information at a streaming client processing audio content in an adaptive streaming environment is shown in the flow chart of fig. 4A. The method 400 performed at the streaming client comprises steps S410 and S420. The method 450 shown in the flowchart of fig. 4B is related to the details of step S410. The method 450 includes steps S460 to S480.
At step S410, playout related information is generated.
At step S420, playout-related information is output to a network node (e.g., test node) that is different from the network node associated with the streaming client. For example, as explained above, playout related information may be output to the quality assessment service 150.
As described above in the context of fig. 1, the method 400 may further include the step of providing an indication of audio content processed by the streaming client (not shown in the figures).
Steps S460 through S480 of method 450 in fig. 4B are related to the details and potential implementation of step S410 in method 400. It is understood that step S410 may include one or more, potentially all, of steps S460 through S480.
At step S460, a playout buffer associated with the streaming client is analyzed to determine bit rate information indicative of the bit rate of the segments of the audio signal played out by the streaming client. Thus, the bit rate information may indicate a respective bit rate for each of a plurality of sequential segments of the playout content. In addition to the bit rate information, the analysis of the playout buffer may also yield additional information about the composition of the playout buffer.
Here, it is understood that the playout buffer typically contains a series of clips. The change in bit rate may occur per segment, for example, due to the action of an ABR policy operating on the streaming client.
At step S470, manifest information associated with the audio content is analyzed. Analyzing the manifest information may yield information about the currently used bit rate ladder.
At step S480, properties of playout devices associated with the streaming client are analyzed. The nature of the playout device may be related to the type of reproduction system used and/or the type of device (e.g. headphone playout or speaker playout).
Thus, returning to fig. 3A, the playout analysis (e.g., by playout analysis block 120) may include one or more of a playout buffer analysis (e.g., at playout buffer analysis block 310), a manifest analysis (e.g., at manifest analysis block 320), and a playout device analysis (e.g., at playout device analysis block 330).
In addition, according to the above, the playout related information (e.g., metadata) may include information about the composition of the playout buffer (e.g., determined by playout buffer analysis), the composition of the currently used bit rate ladder (e.g., determined by manifest analysis), and/or the identity of the playout device (e.g., determined by playout device analysis).
The operation and nature of an instrumented client associated with streaming client 10 may be briefly summarized as follows.
The technology according to the present disclosure requires an instrumented client (which can be easily uploaded to the client device), but does not require any other operations (such as placing a reference on the client, for example).
The playout analysis extracts playout related information (e.g. a series of segments in the playout buffer, information about the playout device, information about the content of the manifest) and sends it upstream as metadata.
Metadata may be used to recreate the test signal or features of the test signal (e.g., spectrograms, such as gammatine spectrograms), and may be used to perform a lookup of the relevant reference signal. Given two signals (i.e., a test signal and a reference signal), after selecting a properly trained model, an indication of the quality score (e.g., a probability distribution representing the quality score) may be calculated by the quality assessment service 150.
Additionally, potential applications and advantages of an instrumented client and/or its operation may include the following:
The system can distinguish between different playout scenarios. For example, in some cases, headphone playout may be more critical than speaker playout. This may be reflected in the quality score generated by the client. To achieve this, the cloud service performing the quality assessment may include a set of pre-trained models. For example, there may be models trained on listening test data from tests performed by headphones. There may be another model trained on listening tests performed through discrete speakers. Since the two playout scenes generally differ in terms of their criticality, it may be beneficial to use dedicated models for these scenes and then use appropriate models to perform the evaluation of the playout quality.
Fig. 3B is a block diagram schematically illustrating an example process of selecting a model (from a set of pre-trained models) that may be used by a quality assessment service 150 (e.g., a machine listener). Model selection is based on playout related information 20 sent upstream by instrumented client 10.
For example, different models may have been trained based on different training data associated with respective different use cases. In one embodiment, different models may have trained on different device characteristics, such as different device types and/or different rendering systems (e.g., headphones or speakers).
Model selection may be performed at model selection block 145. The selection may be made from a set of models 340 that includes individual models 350-1, 350-2, 350-3. Each of these models may be related to a machine listener or DNN who trains an assessment of audio quality for a particular situation. For example, as described above, the models in the set of models 340 may have been trained on different device characteristics.
Returning to the method 200 of fig. 2, the audio quality assessment algorithm employed by this method may use a set of pre-trained models for audio quality assessment. Then, at step S203, determining (e.g., generating) an estimate of the audio quality may include selecting a pre-trained model among the set of pre-trained models based on the playout-related information.
For example, as explained above, the playout related information 20 may include information related to a playout device associated with the streaming client. The pre-trained model may then be selected based on information related to the playout device. The information related to the playout device may include an indication of the playout device (e.g., headphones, a sound bar, a discrete speaker, etc.) and/or an indication of a characteristic of the playout condition (e.g., SNR, etc.).
In some embodiments, the quality assessment service may be associated with a generating machine listener (e.g., a stereo generating machine listener) implemented by DNN (e.g., as the aforementioned audio quality assessment algorithm) using an algorithm described below under the chapter machine listener. This algorithm may operate on a spectrogram (e.g., gammatine spectrogram) calculated from or on the test and reference signals, rather than directly on the waveform. This means that both the test and reference signals can be assembled from pre-computed blocks (e.g., blocks of pre-computed spectrograms), which can reduce cloud storage requirements and computational burden. This is particularly advantageous from the standpoint of reducing the cost of running the machine listener in the cloud.
The spectrogram (or other audio feature) may be calculated per segment based on segments introduced by the transport mechanism used by the content delivery system (e.g., by the CDN 105).
Thus, the (stereo) generating listener may operate on the gammatine spectrograms of the left, right, mid and side signals of the reference and encoded stereo signal (e.g., as described in [5 ]). Gammatine filters are popular approximations to the filtering performed by the ear. Thus, gammatine-based spectrograms may be regarded as a more perceptually driven representation than traditional spectrograms. For example, a gammatine spectrogram of the audio signal may be calculated using a window size of 80ms, a hop size of 20ms, and 32 frequency bands ranging from 50Hz up to 24 kHz. The resulting gammatine spectra may be pre-computed, paired and stacked along the channel dimension for short segments of the reference and encoded signals (e.g., 1 second signals in ABR ladder), resulting in, for example, an input size of 8 x 32 x 50 (channel x band x time frame) to the neural network.
Thus, in some embodiments, the previously mentioned estimated representation of the test audio signal and the previously mentioned representation of the reference audio signal may each be related to one or more gammatine spectrograms (e.g., left (L), right (R), middle (M), and side (S) spectrograms).
However, it is understood that techniques (e.g., methods and apparatus) according to the present disclosure are not limited to use with spectrograms (e.g., gammatine spectrograms), but that these techniques may likewise operate on waveforms or other audio features. Also, spectrograms other than gammatine spectrograms may be used for this purpose, such as those of other perceptual drives. However, for simplicity of presentation, there is no intended limitation, and reference spectrograms (particularly gammatine spectrograms) will be substituted for generic spectrograms or waveforms in the following.
Fig. 5 schematically illustrates a process of assembling a reference signal and reconstructing a test signal based on playout related information 20 (e.g. metadata), which information 20 is sent upstream by an instrumented client 10 to a quality assessment service 150. In some embodiments, the data flow process for feeding the quality assessment model (e.g., generating a listener model, such as trained network 40 in fig. 1) may include several pre-computed steps.
The pre-computed spectrograms (e.g., gammatine spectrograms) are stored per fragment in a reference repository 580. Based on the indication of (fragments of) the audio content played out by the streaming client 10 (e.g., IDs and segments of the played out content items), the reference repository 580 is queried by the reference lookup block 585 for the fragment-by-fragment assembly of the reference signal at the assembly reference block 590. The assembled reference signal 560 is provided to an audio quality assessment algorithm (such as a machine listener) for audio quality assessment at the machine listener analysis block 540.
In addition, playout related information 20 is provided to the assembly test signal block 575 along with an indication of (segments of) audio content (e.g., IDs and segments of the playout content item) being played out by the streaming client 10. Based on the playout related information 20 (e.g. bit rate, codec configuration) the content repository 570 is queried to assemble the test signal again segment by segment. This produces the aforementioned representation of the test audio signal 530 for input to an audio quality assessment algorithm (such as a machine listener) for audio quality assessment at the machine listener analysis block 540. The assembly of the representation of the test audio signal 530 may for example correspond to step S220 of the method 200.
Although not shown in fig. 5, it is understood that playout-related information 20 may also be provided to the machine listener analysis block 540 for model selection as described above.
Based on the assembled representation of the test audio signal 530 and the assembled reference signal 560, an audio quality assessment algorithm may generate an estimate of the audio quality of the test audio signal, as explained above with reference to step S230 of the method 200.
FIG. 6 is a non-limiting example of a graphical user interface showing results provided by techniques according to the disclosed embodiments. The GUI includes indicators/selectors 610 through 650 of the available levels of the bit rate ladder, and an indication 660 of the audio fraction of the test signal. This indication 660 may include, for example, a mean and confidence interval of subjective listening scores (e.g., such as a MUSHRA score).
Downstream application for estimation of audio quality
Techniques according to the present disclosure may operate in both online and offline settings.
In an online setting, quality analysis may be performed dynamically and performance scores (e.g., estimates of audio quality) may be distributed wherever it is needed in the delivery system.
Offline settings include aggregating playout related information (e.g., playout metadata) from multiple streaming clients (e.g., two groups of clients for AB testing). The test signal may be constructed from the collected playout-related information in an off-line manner (e.g., after the experiment is completed). The quality assessment service (e.g., a machine listener) may then perform quality analysis offline, providing performance statistics to the client (or clients).
In general, whether an online setting or an offline setting is applied, techniques according to this disclosure may be used to compare streaming clients in different groups of streaming clients, or to compare different groups of streaming clients.
An example of a corresponding method 700 is schematically illustrated in the flow chart of fig. 7. Method 700 includes steps S710 and S720 and may be performed after method 200 described above or in conjunction with method 200 described above.
At step S710, an estimate of the audio quality of the test audio signal for the streaming client in each of the plurality of populations of streaming clients is determined. This may be done as described above in the context of method 200.
At step S720, estimates of audio quality determined for a plurality of groups of streaming clients are compared to each other. Typically, the determined estimate is analyzed.
Examples, applications and use cases of such comparisons are schematically illustrated in the block diagrams of fig. 8 and 9.
In the example of fig. 8, the content server 850 (content source) provides respective (audio) content 852, 854 to the first and second content delivery networks CDN1 830 and CDN2 840. The first CDN 830 provides content 835 to a first group 810 of streaming clients 815. The second CDN 840 provides content 845 to a second community 820 of streaming clients 825. The streaming clients of both communities 810, 820 provide respective sets of playout related information 870, 880 to a quality assessment service 860, which quality assessment service 860 determines respective estimates 865 (or estimates of playout performance in general) of the audio quality of the communities 810, 820 by means of the techniques as set forth above based on the respective sets of playout related information 870, 880. The corresponding estimate 865 of the determined audio quality may require receiving a downloaded segment 855 (which is provided to the streaming client) or a representation thereof from the content server 850. Comparing the estimates 865 of the audio quality for the two populations 810, 820 allows, for example, to infer information about the different capabilities of the different CDNs 830, 840. This information may be used to optimize content delivery to a community of streaming clients.
In the example of fig. 9, a content server 950 (content source) provides (audio) content 952 to a CDN 910, which CDN 910 provides content 932 to a first community 910 of streaming clients 915 and content 934 to a second community 920 of streaming clients 925. The streaming clients of both communities 910, 920 provide respective sets of playout related information 970, 980 to a quality assessment service 960, which quality assessment service 960 determines respective estimates 965 of the audio quality (or estimates of playout performance in general) for the communities 910, 920 by means of techniques as set forth above based on the respective sets of playout related information 970, 980. Determining the corresponding estimate 965 of the audio quality may require receiving the downloaded segment 955 (which is provided to the streaming client) or a representation thereof from the content server 950. Comparing the estimates 965 of the audio quality for the two groups 910, 920 of streaming clients allows to infer information about the different capabilities of the different groups, e.g. in case different delivery methods and/or playout methods are employed for/by the different groups. This information may be used to optimize content delivery to and/or playout by a community of streaming clients.
As a further application or use case, the proposed technique may be used to provide a feedback loop for edge processing of content. For example, if the content encoder is located at an edge of the network, an estimate of the estimated audio quality (e.g., a MUSHRA score) may be used to fine tune the bitrate ladder for delivering audio content to the client.
An example of a framework involving such a feedback loop is schematically illustrated in fig. 10. The content server 1050 (content source) provides (audio) content to an encoding and packaging coordination engine 1090 (encoding and packaging engine), which engine 1090 encodes and packages the content according to a set of one or more rules and provides the encoded and packaged content to points of presence (pops) in the CDN 1030 (or CDNs). The rules employed by the encoding and packaging engine 1090 may be related to maximizing a cost function and/or minimizing a cost function, for example. For example, one or more rules may be related to minimizing the number of levels in a bit rate ladder while optimizing average (worst case) performance, and/or determining an optimal bit rate for maximizing average (worst case) performance.
CDN 1030 provides (audio) content to community 1010 of streaming clients 1015, which streaming clients 1015 in turn provide playout-related information 1070 to quality assessment service 1060. The quality assessment service 1060 determines playout performance (e.g., average playout performance, worst case playout performance, etc.) for the community 1010 of streaming clients 1015 by techniques as set forth above and provides an indication of the determined playout performance to the encoding and packaging engine 1090.
According to the techniques described above, playout performance may be related to the aforementioned estimation of audio quality or quantities derived therefrom.
Thus, in general, step S240 of the above-described method 200 may include outputting (e.g., as playout performance) an estimate of the determined audio quality of the test audio signal to or in connection with a network node (e.g., the aforementioned encoding and packaging engine 1090) for performing encoding and/or packaging of the audio content.
Based on the playout performance 1065, the encoding and packaging engine 1090 in the example of fig. 11 may then optimize the encoding and/or packaging of the content.
For example, in a framework as shown in the example of fig. 10, one or more of steps S1110 to S1130 of method 1100 shown in the flowchart of fig. 11 may be performed.
Step S1110 includes optimizing encoding and/or packaging or correlating therewith based on an estimate of the audio quality of the test audio signal.
Step S1120 includes determining an optimal number of quality levels in a bitrate ladder for distribution over a content delivery network or related thereto based on an estimate of the audio quality of the test audio signal. This may involve, for example, inputting an estimate of the audio quality of the test audio signal to an optimal number of utility functions (e.g., cost functions and/or cost functions) for determining the quality level.
Step S1130 includes determining a configuration of the encoding tools and/or a set of encoding tools or associated therewith based on an estimate of the audio quality of the test audio signal.
Apparatus for implementing a method according to the disclosure
Finally, while reference is primarily made above to methods in accordance with the present disclosure, the present disclosure is equally relevant to an apparatus (e.g., a computer-implemented apparatus) for performing the methods and techniques described throughout the present disclosure. An example of such an apparatus 2100, described in more detail below, is schematically illustrated in fig. 21. Such means may implement, for example, the quality assessment service 150 or the (instrumented) streaming client 10. The device (e.g., its processor 2110) may rely on use cases and/or implementations to receive appropriate input data (e.g., indications of audio content processed by a streaming client or playout related information, among other things). The apparatus 2100 (e.g., the processor 2110 thereof) may be adapted to perform the methods/techniques described throughout this disclosure (e.g., the method 200 of fig. 2, the method 400 of fig. 4A, the method 450 of fig. 4B, the method 700 of fig. 7, and/or the method 1100 of fig. 11), and to generate corresponding output data 1240 (e.g., an estimate of audio quality) depending on the use case and/or implementation.
The present disclosure is equally relevant to corresponding computer programs and computer-readable storage media.
Machine listener
One non-limiting example of an algorithm for implementing or employed by the quality assessment service described above is a so-called machine listener (e.g., a generative machine listener) as described below.
Broadly speaking, a machine listener (e.g., a generating machine listener) in accordance with an embodiment of this disclosure is a neural network (e.g., a deep neural network) trained to evaluate audio by comparing the audio to related reference signals and providing an evaluation result, e.g., as a probability distribution of predicted listener scores in a common manner with subjective listening tests, such as multi-stimulus tests (MUSHRA) with hidden references and anchor points.
In general, the listener scores, as implemented in a MUSHRA test, can be predicted by a system with the signal under test and the reference signal as inputs. An example of such a system comprising a neural network is given in [5], which [5] is hereby incorporated by reference in its entirety.
There will typically be some degree of variability in the listener scores perceived by different listeners for a given pair of input signals. It has been found that there is value in capturing this aspect of data for automatic estimation of the quality of experience in entertainment delivery systems. In an actual subjective MUSHRA test, the mean and standard deviation of the listener scores obtained from different listeners may be calculated, and then the standard deviation may be converted into a confidence interval given the number of listeners and the statistical model.
The inventors have found that training a neural network to directly output both the mean and the confidence interval is challenging. One possible alternative for quantifying the prediction variability may be to use a self-service approach that trains multiple models by targeting mean scores on a subset of randomly sampled data. Each of the trained models will then introduce slightly different predictions. Variability of these predictions is then quantified for confidence levels. However, the drawbacks of the self-service method are twofold. First, using a self-service approach means a high complexity that requires as many models as the number of listeners to be simulated. Second, there is a risk of modeling the prediction variability rather than the listener score distribution.
Another possible alternative for quantifying the prediction variability may be to consider separate modeling of bias (see [1], [2 ]) or bias and inconsistency (see [3], [4 ]) across the individual listeners of the signal under test. The main application of the latter approach is to pre-screen listeners with outlier behavior. A common inconvenience of these methods is the need to track listener identity in the dataset.
The present disclosure seeks to provide improved techniques for quantifying the predictive variability of subjective listening tests. At the application level (i.e., at the time of inference), a trained model (e.g., a generative model) according to the present disclosure provides a distribution of scores from which the mean score as well as standard deviation and/or Confidence Interval (CI) are easily extracted for any number of listeners.
At the training stage, the model according to the present disclosure utilizes individual listener scores, unlike prior methods that use mean subjective scores as targets for training. This has been found to simplify the preprocessing and allow the dataset to influence training in proportion to human effort for listening to test data where the number of listeners varies. In addition, the maximum likelihood principle can be used for parameter estimation.
The technique according to the invention has been found to have the following advantages. The trained model (e.g., generative model) achieves similar performance on the predicted mean as a typical non-generative model, and also predicts confidence intervals. In addition, the trained model is more robust to conditions that are not seen in typical listening tests.
Fig. 12 illustrates a comparison between a conventional model (e.g., DNN) for predicting mean listening scores (e.g., MUSHRA scores) and a model (e.g., DNN implementing a generative machine listener) according to the disclosed embodiments. Given reference encoded audio (or features thereof, such as a gammatine spectrogram), a conventional model (non-generative method) shown on the left hand side predicts a mean subjective listening score (e.g., a mushara score). On the other hand, the generated machine listener model shown on the right hand side provides a distribution of listening scores (e.g., MUSHRA scores). In training, the present disclosure proposes to utilize individual listening scores. Without intended limitation, the architecture implementing the DNN of a generating machine listener may be the architecture described in [5], except that the output stage is configured to provide more than one output, e.g., suitable for representing a distribution of listening scores (e.g., MUSHRA scores).
Description of example embodiments
Given the original signal x and the signal under test y, a generative listener model (or DNN implementing the same) gives an indication of the probability distribution of the (subjective) listening score (e.g., MUSHRA score) s for y, e.g., as a parameterized probability density
(1)
The parameter θ of the model is trained by the principle of maximum likelihood. Then, a listening test with N listeners can be simulated by sampling the model N times. The Negative Log Likelihood (NLL) penalty for training of the model parameter θ for the listener score value s in the dataset may be。
Typically, at the training stage, the inputs are given by training data items, each indicating a respective value of the listening score s. The loss function depends on an indication of the listening score s. The training data items may each also indicate a representation of the audio signal (signal y under test) and a representation of a reference audio signal (original signal x) of the audio signal. Here, the representation of the audio signal y and the representation of the reference audio signal x may be related to a gammatine spectrogram, for example. Each training data item may be obtained by performing a standardized listening test (e.g., a MUSHRA test) on the test signal y and the corresponding reference signal x to produce a listening score s. Performing such a test multiple times (e.g., with different listeners) will produce multiple training data items, represented temporarily as. The test signal y and the corresponding reference signal x may for example be obtained from a suitable audio library. Here and in the rest of the disclosure, it is understood that the reference signal corresponds to an uncoded signal. The test audio signal corresponds to an encoded audio signal (e.g., a signal obtained after encoding, decoding, and if necessary, time alignment with a reference signal to compensate for encoding delays).
If the representations of the test signal y and the reference signal x are related to gammatine spectra (typically L, R, M and S spectra), then 4 training data items (e.g., one for each L, R, M and S spectra) may be generated for each actual sound signal.
Fig. 13 is a flow chart illustrating an example of a method 1300 of configuring (e.g., training) a DNN for estimating an indication of subjective listening scores of an audio signal. The method 1300 includes steps S1310 and S1320.
It is understood that DNN implements a model under consideration (e.g., a generative model). For example, DNN may implement the aforementioned generative machine listener.
Additionally, the listening score to be estimated or predicted by the DNN may be a score according to a predefined (e.g., standardized) listening test. The listening test may apply predefined test metrics and/or test scenarios. One example of such a listening test is a MUSHRA listening test.
At step S1310, an output stage of DNN is provided that generates an indication of the listening score.
At step S1320, the DNN is trained in (at least) a training round among a plurality of training rounds.
The method 1400 as illustrated by the flowchart of fig. 14 is an example of a possible implementation of training DNNs in a training round among a plurality of training rounds at step S1320. The method 1400 includes steps S1410 to S1440.
At step S1410, one or more training data items are input. For example, a small batch of training data items (e.g., 8 training data items) may be entered. As described above, each training data item indicates a respective value of the listening score s. Also as described above, each training data item may also indicate a representation of an audio signal (signal y under test) and a representation of a reference audio signal (original signal x) of the audio signal.
At step S1420, respective indications of the listening scores are determined based on the one or more training data items. These indications may be determined, for example, based on a representation of the audio signal and a representation of the reference audio signal.
In some embodiments, the indication of the listening score may be related to a probability distribution (e.g., a probability density function) of the listening score, wherein the output stage is adapted to generate the probability distribution of the listening score. This probability distribution (e.g. according to equation (1)) The listening scores obtained by multiple (independent) listening tests to the audio signal may be simulated. In addition, the probability distribution may be parameterized by two or more parameters of the probability distribution. Examples of possible parameterizations of the probability distribution will be described below.
If the indication of the listening score relates to a probability distribution of the listening score, determining a respective indication of the listening score based on the one or more training data items may comprise determining a respective parameter of the probability distribution based on the one or more training data items. Here, the parameter determining the probability distribution may be based at least in part on a value of a subjective listening score included in the training data item. In addition, the parameter determining the probability distribution may be based on a current state of the DNN, e.g. a current value of an internal parameter of the DNN.
At step S1430, respective loss values for the one or more training data items are determined by evaluating the loss function. This loss function depends on an indication of the listening score.
In some embodiments, if the indication of the listening score relates to a probability distribution of listening scores, the loss function may depend on a parameter of the distribution. An example of a loss function will be described below.
At step S1440, one or more internal parameters of the DNN are adjusted based on the determined loss values, for example, by using well-known regression and back propagation techniques. The internal parameters of the DNN may be model parameters, for example, coefficients (e.g., filter coefficients) of multiple layers such as the DNN.
If multiple training data items are entered per training round, the adjustment of the internal parameters may be based on, for example, an aggregation of loss values of the training data items, such as a mean or average thereof.
As mentioned above, training DNNs, for example via method 1400, may be based on maximum likelihood principles. Thus, the loss function used for training (e.g., the loss function evaluated at step S1430 of method 1400) may be related to Negative Log Likelihood (NLL) loss. Also as mentioned above, the negative log likelihood loss can be given by:
, (2)
Wherein the method comprises the steps of Is a probability density function (as an example of a probability distribution) of the test score s given a representation of the audio signal y and a representation of the reference audio signal x of the audio signal y, and θ indicates an internal parameter of DNN.
Probability density functionThe first non-limiting example of (2) is by mean μ and varianceParameterized gaussian distribution. The NLL loss can then be given by, for example:
,(3)
Wherein via a path of AndIs used for training.
Thus, the probability distribution at step S1420 of method 1400 may be compared to the passing mean μ and varianceThe parameterized gaussian distribution is correlated. Then, the loss functionCan be composed ofGiven, where c is a constant and s is a subjective listening score. The constant c may be determined, for example, by equation (3)Given.
Probability density functionA second non-limiting example of (a) is a logistic distribution parameterized by mean μ and scale a. The NLL loss in this case can be given by, for example, the following equation:
,(4)
Wherein via a path of AndIs used for training.
Thus, the probability distribution at step S1420 of method 1400 may be related to a logistic distribution parameterized by mean μ and scale a. Then, the loss functionCan be composed ofGiven, where c is a constant and s is a subjective listening score. The constant may be determined, for example, by equation (4)Given.
Models with more than two parameters, such as a mixture of gaussian or logistic, or even a class distribution, may have the ability to model a multi-modal listener score distribution. On the other hand, there is a potential disadvantage of requiring more data for successful training.
Upon inference, an estimate of an indication of subjective listening scores of the audio signal may be determined using an appropriately trained DNN (e.g., a DNN trained as described above). As above, the listening score is assumed to be a score according to a predefined listening test. In addition, DNN is generally assumed to include an input stage for receiving a representation of an audio signal and a representation of a reference audio signal of the audio signal, a plurality of layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal, and an output stage coupled to a last layer of the plurality of layers for generating an indication of a listening score. Here, the processing by the plurality of layers may also be based on the current state of the DNN, for example, the current value of the internal parameter of the DNN.
An example of a corresponding method 1500 using this DNN is illustrated in the flow chart of fig. 15. The method 1500 includes steps S1510 and S1520.
At step S1510, a representation of the audio signal and a representation of the reference audio signal are input to an input stage of the DNN.
At step S1520, a representation of an indication of the listening score is determined based on the output of the output stage of the DNN.
As above, the indication of the listening score may be related to a probability distribution of the listening score, wherein the output stage of the DNN is adapted to generate the probability distribution of the listening score. This probability distribution can be seen as emulating a listening score obtained by a plurality of listening tests to the audio signal. In addition, the probability distribution may be parameterized by two or more parameters of the probability distribution. Thus, the representation of the indication of the listening score determined at step S1520 may be related to a parameter of the probability distribution, for example.
Moreover, where the output of the output stage of the DNN is available, the representation of the probability distribution may be determined, for example, via determining at least one of a mean, a standard deviation, and a confidence interval from the output of the output stage.
For example, the confidence interval may be determined based on the output of the output stage and the number of listeners to be tested for listening to being emulated. Referring to the above example of gaussian parameterization of probability distribution, once the parameter σ has been determined, 95% confidence interval [ ]) Can be calculated as
(5)
It is understood that similar determinations may be applied to the case of parameterization of the probability distribution as a logistic distribution based on scale a.
In addition to the above-described methods, the present disclosure is also related to DNNs for estimating an indication of subjective listening scores of an audio signal. Again, the listening score may be, for example, a score according to a predefined listening test (such as a mushara test). Such DNNs may include an input stage for receiving a representation of an audio signal (e.g., one or more gammatine spectrograms) and a representation of a reference audio signal of the audio signal (e.g., one or more gammatine spectrograms), multiple layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal, and an output stage for generating an indication of a listening score. It is understood that a first one of the plurality of layers is coupled to the input stage and a last one of the plurality of layers is coupled to the output stage. The processing by the multiple layers may also be based on the current state of the DNN, e.g. the current value of an internal parameter of the DNN.
It is understood that the DNN may be implemented, for example, by any suitable computing system (such as the apparatus shown in fig. 21).
In addition, the DNN may have been configured (e.g., trained) by training the DNN according to method 1400 described above.
In particular, the DNN may have been trained by inputting one or more training data items in a training round among a plurality of training rounds, each training data item indicating a respective value of a hearing score, determining a respective indication of the hearing score based on the one or more training data items, determining a respective loss value of the one or more training data items by evaluating a loss function, wherein the loss function depends on the indication of the hearing score, and adjusting one or more internal parameters of the DNN based on the determined loss value.
As above, the indication of the listening score may be related to a probability distribution of the listening score, wherein the output stage of the DNN is adapted to generate the probability distribution of the listening score. This probability distribution may simulate a listening score obtained by a plurality of listening tests on the audio signal and may be parameterized by two or more parameters of the probability distribution. For example, as described above, the probability distribution may be compared to the passing mean μ and varianceThe parameterized gaussian distribution is related to or to a logistic distribution parameterized by the mean μ and the scale a.
When the indication of the listening score is related to a probability distribution of the listening score, determining a respective indication of the listening score based on the one or more training data items may include determining a respective parameter of the probability distribution based on the one or more training data items (e.g., based at least in part on a value of the subjective listening score). Also in this case the loss function will depend on the parameters of the distribution. It is also understood that the parameter determining the probability distribution may be based on the current state of the DNN, e.g. the current value of an internal parameter of the DNN.
Simulation results on example test set
Two stereo listening tests are considered a test set. One listening test tests a low bit rate codec and the other tests a high bit rate codec. Strategies have been devised for selecting the best model among several models from a trained round.
The following factors may be considered during model selection:
stability of the training process. For example, generally, models trained with a logistic distribution exhibit smooth training and validation loss attenuation over a gaussian distribution under the same settings. However, if the training process is fine-tuned using techniques such as gradient clipping, then Gaussian may still be a promising option.
Pearson Correlation Coefficient (PCC) between the predicted mean mushar score and the actual mean mushar listening test score. Higher PCC (near 1) is preferred.
NLL penalty for training and validation set. For example, models trained with gaussian distributions exhibit significantly less NLL loss than models trained with logistic distributions.
Considering the aforementioned aspects, several models (from different rounds, i.e. from different phases of training; for gaussian or logistic) were chosen, which have the highest PCC on the validation set, have lower training NLL losses, and have moderately lower validation losses. For models with similar PCC scores, models with lower training NLL losses (which are typically models generated at later rounds) are retained, but the validated NLL losses are moderately lower. The reason for the latter is because it has been found that the model with the smallest validation loss does not necessarily exhibit the best performance when predicting confidence intervals on the test set.
Fig. 16 is a plot showing an example of mean NLL loss over two listening tests with gaussian and logistic distributions. Overall, the logic strauss showed a higher NLL loss on the test set than the gaussian model. However, lower loss does not necessarily mean that the model is better (e.g., the logistic model visually exhibits a closer fit to the true values).
For the plots shown in fig. 17-20, each section has a reference, a 3.5kHz anchor point and a 7kHz anchor point, followed by a different coded representation.
Fig. 17 is a plot showing example results of a stereo low bit rate test. In particular, the plot shows the accuracy of the generated machine listener predictive mean MUSHRA score for different categories of audio (3 voices, music, and mixed pitch) by training with a logic cliff and gaussian distribution.
Fig. 18 is another plot showing example results of a stereo low bit rate test. Specifically, the plot shows the accuracy of the generated machine listener predictive CI with 44 listeners for different categories of audio (3 voices, music, and mixed pitch) by training with a logic clique and gaussian distribution.
Fig. 19 is a plot showing example results of a stereo high bit rate test. In particular, the plot shows the accuracy of the generated machine listener predictive mean MUSHRA score for different categories of audio (3 voices, music, and mixed pitch) by training with a logic cliff and gaussian distribution.
Finally, fig. 20 is a plot showing example results of a stereo high bit rate test. In particular, the plot shows the accuracy of the generated machine listener predictive CI with 28 listeners for different categories of audio (3 voices, music, and mixed pitch) by training with a logic clique and gaussian distribution.
The Pearson Correlation (PC) evaluates the linear relationship between two consecutive variables. For the examples shown in the above plots of fig. 17-20, the trend of the CI was predicted by the model trained with the logistic distribution (pc= 0.8190) to be closer to the true value (i.e., CI from the listening test) than the model trained with the gaussian distribution (pc= 0.6733). For high bit rates, the logistic distribution (pc= 0.6461) is comparable to the gaussian distribution (pc= 0.6700).
For the predicted mean, the two models behave quite well in the above example. For low bit rate tests, the model trained with the logistic distribution has pc= 0.9172 and the gaussian distribution has pc=0.9183. For high bit rate testing, the model trained with the logistic distribution has pc= 0.9316 and the gaussian distribution has pc=0.9375.
Spearman Correlation (SC) evaluates monotonic relationships. The spearman correlation coefficient is based on the ordered values of each variable, rather than the raw data. It measures rank retention. For low bit rate testing, in the above example, the model trained with the logistic distribution has sc= 0.8991 and the gaussian distribution has sc= 0.8716. For high bit rate testing, the model trained with the logistic distribution has sc= 0.9297 and the gaussian distribution has sc=0.9360.
It may be noted that in subjective scores, the reference, low-pass anchor, and bit rates generally associated with high quality are rated as low CIs, and bit rates associated with lower quality are rated as higher CIs. Unlike the previously mentioned self-service approach in which higher CIs are associated with denser data points in the training set (and vice versa), the diversity of listener scores is modeled according to the techniques of the present disclosure.
Apparatus for implementing a method according to the disclosure
Finally, while reference is primarily made above to methods in accordance with the present disclosure, the present disclosure is equally relevant to an apparatus (e.g., a computer-implemented apparatus) for performing the methods and techniques described throughout the present disclosure. An example of such an apparatus 2100 is schematically illustrated in fig. 21. Such an apparatus 2100 may implement, for example, the deep neural network described above (machine listener). The apparatus 2100 includes a processor 2110 and a memory 2120 coupled to the processor 2110. The memory 2120 may store instructions for the processor 2110. The processor 2110 may rely on, among other things, use cases and/or implementations to receive appropriate input data (e.g., appropriate training data during a training phase, or appropriate test and reference audio signals at the time of inference). The processor 2110 may be adapted to perform the methods/techniques described throughout this disclosure (e.g., the method 1300 of fig. 13, the method 1400 of fig. 14, or the method 1500 of fig. 15), and generate corresponding output data 1240 (e.g., an indication of a listening score) depending on the use case and/or implementation. For example, apparatus 2100 may implement the method of training DNNs described above, or it may implement the (trained) DNNs described above.
The present disclosure is equally relevant to corresponding computer programs and computer-readable storage media.
Interpretation of the drawings
Aspects of the systems described herein may be implemented in a suitable computer-based audio processing network environment (e.g., a server or cloud environment) for processing digital or digitized audio files. Portions of the adaptive audio system may include one or more networks including any desired number of individual machines, including one or more routers (not shown) for buffering and routing data communicated between computers. Such networks may be constructed on a variety of different network protocols and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
One or more of the components, blocks, processes, or other functional components may be implemented by a computer program that controls the execution of a processor-based computing device of the system. It should also be noted that the various functions disclosed herein may be described using hardware, firmware, and/or as any number of combinations of data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media that may contain such formatted data and/or instructions include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic, or semiconductor storage media.
In particular, it should be understood that embodiments may include hardware, software, and electronic components or modules, which, for purposes of discussion, may be illustrated and described as if most of the components were implemented solely in hardware. However, those skilled in the art will appreciate, based on a reading of this detailed description, that in at least one embodiment, the electronic-based aspects may be implemented in software (e.g., stored on a non-transitory computer-readable medium) executable by one or more electronic processors, such as microprocessors and/or application specific integrated circuits ("ASICs"). It should be noted, therefore, that embodiments may be implemented using a number of hardware and software based devices, as well as a number of different structural components. For example, the systems, services, clients, nodes, etc. described in the context of fig. 1, 3A, 3B, 5, 8, 9, 10, 12, and/or 21 above may include or be implemented by one or more electronic processors, one or more computer-readable medium modules, one or more input/output interfaces, and various connections (e.g., a system bus) connecting the various components.
While one or more implementations have been described by way of example and with respect to particular embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements, as will be apparent to those skilled in the art. The scope of the appended claims is therefore to be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having" and variations thereof is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless specified or limited otherwise, the terms "mounted," "connected," "supported," and "coupled" and variations thereof are used broadly and encompass both direct and indirect mountings, connections, supports, and couplings.
Enumerated example embodiments
Various aspects and implementations of the present disclosure may also be appreciated from the example embodiments (EEEs) listed below, which are not the claims.
EEE1. A method for evaluating playout performance of an adaptive streaming of a streaming client, wherein the method uses an objective quality assessment algorithm simulating an intrusive quality test, wherein the method is implemented on a different network node than the client, wherein the method comprises receiving playout related metadata from the client, using the metadata to reconstruct a description of test signals required by the objective quality assessment algorithm, reconstructing a description of reference signals required by the objective quality assessment algorithm, and calculating performance estimates and distributing them to network nodes other than the node where the method operates.
EEE2. EEE1, wherein an objective quality assessment algorithm uses a set of pre-trained models and performs selection of the model based on metadata received from a client.
EEE3. EEE1 or EEE2, wherein the metadata received from the client includes information that facilitates reconstruction of a description of the test signal required by an objective quality assessment tool at a network node implementing the method, e.g., segment names and bit rates, and/or a representation of a set of spectrograms calculated based on the segment content (required by an objective quality assessment algorithm).
EEE4. Methods of EEE3 in which metadata received from clients includes information about the playout device and/or playout conditions, such as an indication of the playout device (headphones, a sound bar or discrete speakers), characteristics of the playout conditions (e.g., signal-to-noise ratio), and/or others.
EEE5. The method of any of the foregoing EEEs, wherein the estimate provided by the method is used to compare at least two different populations of clients.
EEE6. The method of any of the preceding EEEs, wherein the performance estimate calculated by the method is provided to a service that optimizes encoding/packaging of content.
EEE7. EEE6, wherein performance estimates are used as inputs to a utility function that is used to determine an optimal number of quality levels in a bit rate ladder to be distributed over a point of presence (PoP) of a CDN.
EEE8. EEE6 methods in which performance estimates are used to determine the tuning of a content encoder (e.g., configuration, set of encoding tools).
EEE-A1. A method of configuring a deep neural network DNN for estimating an indication of subjective listening scores of an audio signal, wherein the listening scores are scores according to a predefined listening test, the method comprising:
An output stage providing DNN generating an indication of the listening score, and
Training the DNN by performing the following operations in a training round among a plurality of training rounds:
inputting one or more training data items, each training data item indicating a respective value of the listening score;
determining respective indications of the listening scores based on the one or more training data items;
Determining respective loss values for the one or more training data items by evaluating a loss function, wherein the loss function depends on the indication of the listening score, and
One or more internal parameters of the DNN are adjusted based on the determined loss value.
A method according to EEE-A2 wherein the indication of the listening score is related to a probability distribution of the listening score, the output stage being adapted to generate a probability distribution of the listening score, wherein the probability distribution simulates a listening score obtained by a plurality of listening tests on the audio signal, and wherein the probability distribution is parameterized by two or more parameters of the probability distribution;
wherein determining the respective indications of the listening scores based on the one or more training data items comprises determining respective parameters of the probability distribution based on the one or more training data items, and
Wherein the loss function depends on parameters of the distribution.
EEE-A3. Methods according to EEE-A1 or EEE-A2, wherein training the DNN is based on the principle of maximum likelihood.
EEE-A4. The method according to any one of the preceding EEE-A, wherein said loss function is related to a negative log likelihood NLL loss.
EEE-A5. According to the method of EEE-A4 when dependent on EEE-A2, the negative log likelihood loss is determined byGiven, whereinIs the probability distribution of the test score s given the representation of the audio signal y and the representation of the reference audio signal x of the audio signal y, and θ indicates the internal parameter of the DNN.
EEE-A6. The method according to EEE-A2 or any one of EEE-A3 to EEE-A5 when dependent on EEE-A2, wherein the probability distribution is related to the passing mean μ and varianceParameterized gaussian distribution correlation and wherein the loss functionFrom the following componentsGiven where c is a constant and s is the subjective listening score.
EEE-A7. Method according to EEE-A2 or any one of EEE-A3 to EEE-A5 when dependent on EEE-A2, wherein said probability distribution is related to a logistic distribution parameterized by mean μ and scale a, and wherein the loss functionFrom the following componentsGiven where c is a constant and s is the subjective listening score.
EEE-A8. The method according to any one of the preceding EEE-A, wherein said training data item further indicates a representation of said audio signal and a representation of a reference audio signal of said audio signal.
EEE-A9. The method of EEE-A8 wherein the representation of the audio signal and the representation of the reference audio signal are related to a gammatine spectrogram.
EEE-A10. The method according to any of the preceding EEE-A, wherein said predefined listening test is a multi-stimulus test MUSHRA listening test with hidden references and anchor points.
EEE-A11. The method according to any of the preceding EEE-A, wherein said DNN implements a generative model.
EEE-a12. A method of estimating an indication of a subjective listening score of an audio signal using a deep neural network DNN, wherein the listening score is a score according to a predefined listening test,
Wherein the DNN comprises:
an input stage for receiving a representation of the audio signal and a representation of a reference audio signal for the audio signal;
A plurality of layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal, and
An output stage for generating an indication of the listening score, and
Wherein the method comprises:
inputting a representation of the audio signal and a representation of the reference audio signal to the input stage, and
A representation of the indication of the listening score is determined based on an output of the output stage.
EEE-A13. The method of EEE-A12, wherein the indication of the listening score is related to a probability distribution of the listening score, the output stage of the DNN is adapted to generate the probability distribution of the listening score, wherein the probability distribution simulates a listening score obtained by a plurality of listening tests on the audio signal, and wherein the probability distribution is parameterized by two or more parameters of the probability distribution.
EEE-A14. The method of EEE-A13 wherein determining a representation of the probability distribution includes determining at least one of a mean, a standard deviation, and a confidence interval from the output of the output stage.
EEE-A15. The method according to EEE-A14, wherein the confidence interval is determined based on the output of the output stage and the number of listeners to be tested for listening to by the simulation.
EEE-A16. The method according to any one of EEE-A13 to EEE-A15, wherein the probability distribution is related to the passing mean μ and varianceThe parameterized gaussian distribution is related or related to a logistic distribution parameterized by the mean μ and the scale a.
EEE-A17. The method according to any one of EEE-A13 to EEE-A16, wherein the representation of the audio signal and the representation of the reference audio signal are related to a gammatine spectrogram.
EEE-A18. The method according to any one of EEE-A13 to EEE-A17, wherein said predefined listening test is a multi-stimulus test MUSHRA listening test with hidden references and anchor points.
EEE-a19. A deep neural network DNN for estimating an indication of a subjective listening score of an audio signal, wherein the listening score is a score according to a predefined listening test, the DNN comprising:
an input stage for receiving a representation of the audio signal and a representation of a reference audio signal for the audio signal;
A plurality of layers for performing processing based on the representation of the audio signal and the representation of the reference audio signal, and
And an output stage for generating an indication of the listening score.
EEE-a20. A DNN according to EEE-a19, wherein the DNN has been configured by training the DNN by performing the following operations in a training round among a plurality of training rounds:
inputting one or more training data items, each training data item indicating a respective value of the listening score;
determining respective indications of the listening scores based on the one or more training data items;
Determining respective loss values for the one or more training data items by evaluating a loss function, wherein the loss function depends on the indication of the listening score, and
One or more internal parameters of the DNN are adjusted based on the determined loss value.
A DNN according to EEE-a21, wherein the indication of the listening score relates to a probability distribution of the listening score, the output stage of the DNN being adapted to generate a probability distribution of the listening score, wherein the probability distribution simulates a listening score obtained by a plurality of listening tests on the audio signal, and wherein the probability distribution is parameterized by two or more parameters of the probability distribution.
EEE-A22. According to the DNN of EEE-A21 when dependent on EEE-A20,
Wherein determining the respective indications of the listening scores based on the one or more training data items comprises determining respective parameters of the probability distribution based on the one or more training data items, and
Wherein the loss function depends on parameters of the distribution.
EEE-A23 DNN according to EEE-A21 or EEE-A22, wherein the probability distribution is related to the passing mean μ and varianceThe parameterized gaussian distribution is related or related to a logistic distribution parameterized by the mean μ and the scale a.
The DNN according to any of EEE-A19 to EEE-A23, wherein the representation of the audio signal and the representation of the reference audio signal are related to a gammatine spectrogram.
The DNN according to any of EEE-A19 to EEE-A24, wherein said predefined listening test is a multi-stimulus test MUSHRA listening test with hidden references and anchor points.
EEE-A26. An apparatus comprising a processor and a memory coupled to the processor and storing instructions for the processor, wherein the processor is adapted to perform a method according to any one of EEE-A1 to EEE-A18.
EEE-A27. An apparatus comprising a processor and a memory coupled to the processor and storing instructions for the processor, wherein the processor is adapted to implement DNN according to any of EEE-A19 to EEE-A25.
EEE-A28. A program comprising instructions which, when executed by a processor, cause the processor to perform a method according to any one of EEE-A1 to EEE-A18.
EEE-A29. A program comprising instructions which, when executed by a processor, cause the processor to implement DNN according to any of EEE-A19 to EEE-A25.
EEE-A30. A computer readable storage medium stores a program of EEE-A28 or EEE-A29.
EEE-B1. A method of evaluating playout performance in an adaptive streaming environment, the method comprising:
obtaining broadcast related information from a streaming media client;
Estimating a representation of a test audio signal based on the playout-related information, wherein the test audio signal is an audio signal played out by the streaming client, and
An estimate of the audio quality of the test audio signal is determined based on the estimated representation of the test audio signal using an audio quality assessment algorithm.
EEE-B2. The method according to EEE-B1, further comprises:
a representation of a reference audio signal of the test audio signal is generated.
EEE-B3. The method according to any one of the preceding EEE-B, further comprising:
An indication of audio content processed by the streaming client is obtained from the streaming client.
EEE-B4. The method of EEE-B3 wherein estimating the representation of the test audio signal is further based on an indication of the audio content.
EEE-B5. According to the method of EEE-B3 or EEE-B4 when dependent on EEE-B2, wherein generating a representation of the reference audio signal is based on an indication of the audio content.
EEE-B6. Method according to any of the preceding EEE-B, wherein said playout related information comprises bit rate information indicating the bit rate of an audio signal played out by said streaming client, and
Wherein estimating the representation of the test audio signal is based on the bit rate information.
EEE-B7. Method according to any of the preceding EEE-B, wherein said audio quality assessment algorithm uses a set of pre-trained models for audio quality assessment, and
Wherein generating the estimate of the audio quality comprises selecting a pre-trained model among the set of pre-trained models based on the playout-related information.
EEE-B8. A method according to EEE-B7 wherein said playout related information comprises information related to a playout device associated with said streaming client, and
Wherein the pre-trained model is selected based on information related to the playout device.
The method according to any of the preceding EEE-B, wherein the audio quality assessment algorithm is implemented by a deep neural network DNN for estimating as an estimate of the audio quality an indication of a subjective listening score for a representation of a test audio signal, wherein the listening score is a score according to a predefined listening test, the DNN comprising:
An input stage for receiving a representation of the test audio signal and a representation of a reference audio signal of the test audio signal;
a plurality of layers for performing processing based on the representation of the test audio signal and the representation of the reference audio signal, and
And an output stage for generating an indication of the listening score.
EEE-B10. The method according to EEE-B9, wherein the DNN has been configured by training the DNN in a training round among a plurality of training rounds by:
inputting one or more training data items, each training data item indicating a respective value of the listening score;
determining respective indications of the listening scores based on the one or more training data items;
Determining respective loss values for the one or more training data items by evaluating a loss function, wherein the loss function depends on the indication of the listening score, and
One or more internal parameters of the DNN are adjusted based on the determined loss value.
EEE-B11. A method according to any of the preceding EEE-B, wherein the method is implemented at a different network node than the streaming client.
EEE-B12. A method according to any one of the preceding EEE-B, wherein the estimated representation of the test audio signal is related to one or more gammatine spectrograms.
EEE-B13. According to EEE-B2 or any EEE-B method that relies on EEE-B2, wherein the representation of the reference audio signal is associated with one or more gammatine spectra.
The method according to any of the preceding EEE-B, further comprising outputting an estimate of the audio quality of the test audio signal to a network node different from the network node associated with the streaming client.
EEE-B15. Method according to any of the preceding EEE-B, wherein an estimate of the audio quality of the test audio signal is output to a network node for performing encoding and/or packetization of the audio content, and
The method further comprises optimizing the encoding and/or packaging based on an estimate of the audio quality of the test audio signal.
EEE-B16. The method according to EEE-B15 further includes determining an optimal number of quality levels in the bitrate ladder for distribution over the content delivery network based on the estimate of the audio quality of the test audio signal.
EEE-B17. The method according to EEE-B15 or EEE-B16 further comprises determining a configuration of an encoding tool and/or a set of encoding tools based on an estimate of the audio quality of the test audio signal.
EEE-B18. The method according to any one of the preceding EEE-B, further comprising:
Determining an estimate of the audio quality of the test audio signal for the streaming client in each of a plurality of groups of streaming clients, and
An estimate of audio quality determined for the plurality of groups of streaming clients is compared.
EEE-B19. A method of providing playout related information at a streaming client processing audio content in an adaptive streaming environment, the method comprising:
generating the playout related information by one or more of:
analyzing a playout buffer associated with the streaming client to determine bit rate information indicative of a bit rate of a segment of an audio signal played out by the streaming client;
analyzing inventory information associated with the audio content, and
Analyzing characteristics of playout devices associated with the streaming client, and
The playout related information is output to a network node different from the network node associated with the streaming client.
EEE-B20. An apparatus comprising a processor and a memory coupled to the processor and storing instructions for the processor, wherein the processor is adapted to perform a method according to any one of EEE-B1 to EEE-B19.
EEE-B21. A program comprising instructions which, when executed by a processor, cause the processor to perform a method according to any one of EEE-B1 to EEE-B19.
EEE-B22A computer readable storage medium stores a program of EEE-B21.
Reference to the literature
[1] Y, leng, X, tan, S, zhao, F, soong, X, -Y, li and T, qin, "MBNET: MOS Prediction for Synthesized SPEECH WITH MEAN-Bias Network", ICASSP 2021, pages 391-395
[2] W. -C.Huang, E.Cooper, J.Yamagishi and T. Toda,"LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech",ICASSP 2022, pages 896-900
[3] G, mittag, S. Zadtootaghaj, T.Michael, B. Naderi and S. Möller,"Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets",2021 13th International Conference on Quality of Multimedia Experience (QoMEX), pages 97-102
[4] Zhi Li、Christos G. Bampis、Lucjan Janowski、Ioannis Katsavounidis,"A Simple Model for Subject Behavior in Subjective Experiments",Proc. Int'l. Symp. on Electronic Imaging: Human Vision and Electronic Imaging, 2020, Pages 131-1-131-14, https:// doi.org/10.2352/ISSN.2470-1173.2020.11.HVEI-131
[5] Co-pending patent application "Robust Intrusive Perceptual Audio Quality Assessment based on Convolutional Neural Networks", applicant's docket No. D20118, filed as U.S. provisional patent application 63/119,318 and International patent application PCT/EP2021/083531, as disclosed in WO/2022/112594.