GB2487795A

GB2487795A - Indexing media files based on frequency content

Info

Publication number: GB2487795A
Application number: GB1102075.7A
Authority: GB
Inventors: Fadil Channer; Hussan Choudhry; Amin Ur Rehman; Juan Cabrera
Original assignee: SLOWINK Ltd
Current assignee: SLOWINK Ltd
Priority date: 2011-02-07
Filing date: 2011-02-07
Publication date: 2012-08-08
Also published as: GB201102075D0

Abstract

A computer implemented method provides a plurality of audio data files 26, each comprising a header 30 and audio signal data 28, and configures a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal. In some cases the frequency content of the audio signal comprises a plurality of frequency data windows. The processor 20 applies a transform such as a Fourier transform to each data window (100, 110 fig. 2) to determine spectral or frequency content (100â , 110â fig. 2) and writes the spectral content data to a second memory area 18 of memory 14. The index may be based on a centre of gravity of a weighted average of frequency content.

Description

Data Indexing The present invention relates to methods, apparatus and systems for indexing and retrieving media files, such as audio data files, for example music data files.

Collections of media data files, such as music are typically arranged according to indices such as the name of the artist/author, title, album name genre, or another index assigned by the owner or creator of a collection. Music collections often comprise tens or even hundreds of gigabytes of data and include tens of thousands of individual audio tracks.

Makers of audio player devices have devised user interfaces which provide visual indications of the content of groups of files based on the album art work. These enable users to browse a collection without relying on text based indicators. Devices have also been provided which enable users to assemble playlists based on subjective criteria. Some devices enable a user to assign a score or preference indicator to a file or group of files, for example by providing a star rating for a particular track. Other methods of subjective categorisation, for example by genre have also been proposed.

The inventors in the present case have recognised that systems based on a subjective (i.e. human) judgment of some abstract' qualitative feature provide divorced' tags with no predictable relationship between them. For example the same human listener may apply different tags to the same music on different days, depending on their mood or other factors. Tags applied by a number of different listeners are likely to be still less predictable. The inventors in the present case have further recognised that the label or tag applied to a particular piece of music in such a subjective system does not describe a reliable relative categorisation of music and they have appreciated that, in practical modern music databases this presents a technical challenge. Absent specialist prior knowledge, a user wishing to select a track from a collection must browse through it. This means that the supporting computer hardware, for example a hand held device or internet music streaming system, must retrieve at least the description of each file that the user wishes to review before making their selection. If a user is trying to select music which is suitable for a particular mood or circumstance they may also wish to review some of the content making it necessary to retrieve and play back the content of music files which may be totally unsuited to the user's requirements. This browsing activity comes with an associated overhead in processing power bandwidth, power consumption and, in distributed systems bandwidth.

The inventors in the present case have recognised that there exists a need in the art for a method and apparatus to facilitate retrieval of media data files. Aspects and examples of the present invention are set out in the claims.

Unlike existing methods of configuring music databases, in which pieces of music are tagged with a category, "star-rating" or other subjective score, examples of the invention provide a quantitative relative scale against which different types of music can be judged.

This has the advantage that a user can retrieve a piece of music based on it being, for example more energetic than one music track and, for example, less instrumental than another track. The provision of a quantitative (objectively measured) relationship between music tracks enables users to locate music efficiently to suit a particular mood or requirement without needing to trawl through a large music library and without requiring the user to have specific knowledge of the music tracks from which they are selecting.

In addition, by providing quantitative metrics of qualitative (perceived) features of music tracks, embodiments of the invention enable reliable relative categorisation of music. In geographically distributed systems, such as internet systems this has the particular advantage of reducing network traffic associated with data retrieval because many fewer options need to be presented to a user before a selection is made. In the context of portable user equipment such as a handheld media playback device, cellular telephone, tablet computer, or other media storage device the power consumption associated with selecting media can be reduced because fewer choices need to be presented to a user.

In an aspect there is provided a computer implemented method comprising: providing a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, configuring a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal. This has the advantage that music data files can be efficiently selected from the database.

In one possibility frequency content of the audio signal comprises a plurality of frequency data windows and each frequency data window is based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal. In one possibility having a dependence upon frequency content of the audio signal comprises having a dependence upon a centre of gravity of frequency content of the audio signal. In one possibility the centre of gravity is determined based on a weighted average of the frequency content. In one possibility the weighting of a frequency band in the weighted average is based on the amplitude or energy of the audio signal at that band. In some possibilities the weighting of the weighted average comprises a weighting function selected to scale the amplitude or energy of the frequency content based on frequency band.

In one example, being based on the total amplitude or energy comprises being based on a measure of loudness, preferably in which the loudness of a frequency band comprises a power law function of the energy or amplitude of the audio signal in that band. In a particularly advantageous example the power law function comprises raising the energy to a non-integer power, preferably wherein the non integer power is less than 1, preferably less than 0.5, preferably less than 0.3, preferably less than 0.25, preferably 0.23, preferably greater than 0.01, preferably greater than 0.1, preferably greater than 0.2, still more preferably, substantially equal to 0.23.

Preferably the weighted average comprises the mean of the weighted averages of a plurality of the frequency data windows. Preferably the frequency data windows are approximately 40ms to 80ms in length.

In some possibilities the frequency content comprises a plurality of data values each based on the energy or amplitude of the audio signal in a corresponding one of a plurality of frequency bands, for example the frequency bands may be selected to correspond to the frequency response of human hearing. Preferably the frequency bands have band edges which correspond to band edges of critical bands of human hearing.

Preferably the frequency bands comprise at least some (for example 1, 2 or more, or 6 or more) of the Bark bands. As will be appreciated, the Bark bands correspond to samplings of a continuous variation in the frequency response of the ear to a sinusoidal process. Preferably twenty six Bark bands are used to provide higher frequency content for an evaluation of sharpness.

In some possibilities each of the plurality of portions of the audio signal is selected to overlap the preceding portion by half of the duration of the portion or less.

In some possibilities the audio signal in each of the plurality of portions is scaled using a data taper, such as a Hanning window. As will be appreciated by the skilled reader, data tapers are functions which reduce the amplitude of time series data towards the limits of a time series data window to reduce spectral leakage during spectral estimation processes. The inventors in the present case have found that a Hanning window is particularly advantageous for the methods described herein because of the side-lobe roll off delay exhibited by this data taper. Other examples may be used such as, for example, a Hamming window or a Blackman window.

In one possibility the index data value is based on the sum of the weighted average taken across the plurality of portions of the audio signal.

In one possibility the frequency content comprises frequency transform data of the audio signal in which each frequency band comprises a plurality of frequency components of the frequency transform data. A frequency transform may comprise a data operation which operation that decomposes a signal into constituent frequencies, preferably wherein the operation comprises one of: a Fourier transform; a fast Fourier transform; a discrete cosine transform; a wavelet transform; a Laplace transform.

In one possibility, having a dependence upon frequency content of the audio signal comprises being correlated with frequency content of the audio signal. In one possibility a plurality of data files are provided, each data file having an associated index data value and in which at least 10% of the variance of the index data values is associated with frequency content of the audio signal.

In an aspect there is provided a computer implemented method comprising: providing an audio data file comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, providing an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal. In some possibilities providing the audio data file comprises transmitting the audio data file over a network in response to a request, in which the request comprises an index data value indicating frequency content of a requested audio data file.

In one possibility providing the audio data file comprises retrieving the audio data file from a memory of an audio player in response to a request comprising an index data value indicating frequency content of a requested audio data file. Preferably the audio player is battery powered.

In an aspect there is provided an audio data file comprising: a header and audio signal data, the header comprising data to enable music playing software to decode the audio signal data to provide an audio signal; and, at least one index data value in addition to the audio signal data and configured to enable the audio data file to be indexed in a database, wherein the index data value has a dependence upon frequency content of the audio signal. In some examples the audio data file comprises an MPEG encoded audio data file such as that described in ISO/IEC 13818-3:1998, the entirety of which is incorporated herein by reference for all purposes. In some examples the audio data file comprises an Advanced Audio Coding (AAC) file such as that described in ISO/IEC 1381 8-7 and ISO/IEC 14496-3 the entire contents of which are hereby incorporated by reference for all purposes.

As will be appreciated, these are merely examples and other audio file formats are within the scope of the invention.

In one possibility there is provided a data structure configured to provide an audio data file according to the preceding aspect. In one possibility there is provided a network message comprising this data structure.

In an aspect there is provided a music player configured to perform any of the methods described herein. In another aspect there is provided a computer program product comprising program instructions operable to program a programmable processor to perform any of the methods described herein.

In an aspect there is provided a data file selection method for facilitating retrieval of a data file from a plurality of stored data files, the method comprising: storing a plurality of data file identifiers, each data file comprising signal data for reading by an output device to provide time-series output to a user and wherein each identifier identifies one of the plurality of stored data files; storing a plurality of measurement values, wherein each data file identifier is associated with at least one measurement value and each measurement value is derived from the signal data of the data file identified by the associated data file identifier; receiving at least one input data value and selecting, based on the measurement values and the input data value, a subset of the data file identifiers; and, outputting at least some of the subset of the data file identifiers to enable a data file identified by the subset to be retrieved from the plurality of stored data files.

In one possibility the signal data comprises audio time series data. In one possibility the signal data comprises visual time series data, such as series of images for

example video data.

In one possibility, one of the least one measurement values is derived from analysis of the spectral content of the signal data. In one possibility, one of the least one measurement values is derived from temporal analysis of the signal data. In one possibility, one of the least one measurement values is derived from temporal analysis of the spectral content of the signal data.

Examples of the invention may be implemented in software, middleware, firmware or hardware or any combination thereof. Embodiments of the invention comprise computer program products comprising program instructions to program a processor to perform one or more of the methods described herein, such products may be provided on computer readable storage media or in the form of a computer readable signal for transmission over a network. Embodiments of the invention provide computer readable storage media and computer readable signals carrying data structures, media data files or databases according to any of those described herein.

Apparatus aspects may be applied to method aspects and vice versa. The skilled reader will appreciate that apparatus embodiments may be adapted to implement features of method embodiments and that one or more features of any of the embodiments described herein, whether defined in the body of the description or in the claims, may be independently combined with any of the other embodiments described herein.

Particular embodiments of the invention will now be described in greater detail, by way of example only, with reference to the accompanying drawings, in which: Figure 1 shows an apparatus for media indexing and retrieval according to an embodiment of the invention; Figures 2A shows a flow chart indicating a method of categorising music data files for later retrieval; Figure 2B shows a very schematic diagram of time series analysis of a media file for application in some embodiments; Figure 3 shows a flow chart indicating a method of categorising music data files for later retrieval; Figure 4A shows a flow chart indicating a method of assigning a characteristic fingerprint to an audio data file; Figure 4B shows a flow chart indicating a method of categorising music data files for later retrieval based on a fingerprint produced according to Figure 4A; Figure 5 shows a flow chart indicating a method of configuring a database.

Before describing embodiments of the invention in detail it is useful to set out an overview. The inventors in the present case have developed quantitative metrics derivable from music audio data which measure qualitative features of that music as perceived by a human listener. One particularly advantageous application of these metrics is to increase the efficiency and speed with which a user can select music tracks having particular qualitative features (e.g. the degree to which music is energetic or calming, instrumental or vocal, minimalistic or busy).

The example of Figure 1 shows a very schematic diagram of an apparatus according to the invention in which a processor 20 is coupled to a file store interface 12 and to a memory 14. In Figure 1, memory 14 comprises at least two memory areas 16, 18 and processor 20 is loaded with program instructions 22 which are configured to program the processor to perform a method such as one or more of those methods described below with reference to Figures 2A and 2B. A user interface 24 is coupled to the processor to provide input/output functions for interaction with a user.

The file store 10 stores a plurality of music data files 26. Each music data file 26 comprises audio signal data 28 which can be played out by an output device to provide music. Associated with each data file 26 is a file identifier 30, such as a file name or other addressable index to enable the file to be retrieved from the file store 10. The file store 10 comprises a non volatile memory such as hard disc drive, or solid state drive. The file store interface 12 is operable to read data files 26 from the file store and to retrieve a particular file 26 based on the corresponding data file identifier 30.

The processor 20 is configured by program instructions 22 to be operable to control the file store interface 12 to retrieve data files 26 from the file store 10 and to load the audio signal data 28 of the data files into memory area 16 of the memory 14 SO that the processor 10 can perform data operations and transformations on the loaded data to produce transformed data and analysis data for storage in a different memory area 18.

External interface 32 is operable to provide data, such as additional music data files 34 to the apparatus for storage in the file store 10. The user interface 24 is controllable by the processor to provide output to a user and to receive user input data to be passed to the processor 10.

Although a specifically adapted hardware apparatus has been described, the apparatus of Figure 1 may be provided by a general purpose computer or by a handheld portable music player such as an MP3 player and/or a mobile telephone or other electronic device. In some examples the system of Figure 1 may be implemented in a geographically distributed apparatus, for example the user input/output functions may be provided by user equipment such as a portable hand held device, or by a web page, whilst the file storage and retrieval functions may be provided by a web server or other network device coupled to the user equipment by a wired and/or wireless communication interface, e.g. WLAN, LAN, or other wide area network, such as the internet and/or combinations of such networks. As will be appreciated, examples of the invention provide particular advantage in distributed systems and in systems having battery powered components.

Referring now to Figure 2A and Figure 2B, Figure 2A shows a flow chart illustrating one method of determining the degree to which music is perceived as minimalistic' by a listener. At step 50 a music file 26, 32 is loaded into a memory area 16 of memory 14. The audio signal data 28 of the music file 26 is identified at step 52 and segmented into a plurality of data windows 100, 110 (shown in Figure 2B) of 60 ms in length. Each window 110 overlaps the previous window 100 by an overlap time 120. In Figure 2B the overlap time 120 is half of the total window length 130. At step 58 the processor 20 then applies a frequency transform, such as a Fourier transform, to each data window to determine the spectral content 100', 110' of each window 100, 110. At step 56 the processor may apply a data taper such as a Hamming window to the audio signal data in each window before performing the frequency transform. Processor 20 writes the spectral content data 100', 110' to a second memory area 18 of memory 14. This spectral content data 100', 110' can be thought of as a two dimensional table for the song which indicates the frequency content of the music as a function of time. For example the table could comprise one column for each time interval and one row for each frequency or frequency band.

At step 60 the processor subdivides each item of spectral content data 100', 110' into twenty six frequency bands 200, 220 selected to correspond to Bark bands. In the description that follows, the index of these frequency bands is denoted by z (where z is an integer between 1 and 26). The edges of these frequency bands (Bark bands) are 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000 and 15500 Hertz. These values are merely useful approximations and the bands may be selected to be contiguous or non-contiguous and may comprise less than all of the Bark bands.

At step 62, the processor determines the total energy, e7, in each band by summing the energy of each frequency component in that band. The processor then determines the perceived loudness, L, of each band according to the relationship L7=e7023. The processor 20 then writes the determined loudness values of each band to the memory 14.

To determine the sharpness of each data window 100, 110 the processor 20 calculates a sharpness metric S, according to equation 1, below for each window.

2 Le/: L (1), wherein, z indicates the index number of the frequency band and g is a function that modifies the contribution of the loudness in each of the frequency bands to the sharpness measurement in that band. The weighting function, g, can be calculated according to a relationship such as g7=1, where z is less than fifteen and 5. A. eq.[kzJ where z is greater than 15.

As will be appreciated this is merely a particularly advantageous method of estimating the perceived spectral sharpness of signal data in a media file. Other measures may be used. For example, the quantity, 5, in equation 1 is related to a weighted average of the frequency transform frequencies wherein the weighting of a particular frequency comprises the loudness at that frequency. In equation 1 the frequencies are banded into a series of bands, z, but the sharpness measure may be calculated from frequency transform data without grouping the frequency data into bands.

At step 66 the processor calculates the sum of the sharpness values, S of all the windows. The processor 20 then normalises the total sharpness by dividing it by the total number of windows.

At step 68 an association between the data file identifier 30 and the sharpness value, S is stored into the file store to enable the file to be indexed based on the sharpness of the audio data.

Figure 3 illustrates another method of indexing music data files. At step 216, steps to 58 proceed as described above with reference to Figure 2A. At step 218 the processor 20 subdivides each item of spectral content data 100', 110' into nine frequency bands and sums the energy or amplitude of the spectral content in each band for each item of spectral content data 100', 110' so that the spectral content of each window 100, 110 is represented by nine data values corresponding to the energy or amplitude in each band. At step 220, based on these sets of nine data values, the processor determines the difference between the energy or amplitude in each band and the energy or amplitude in that band during the preceding time window. This provides a set of difference values, one difference value for the second and each subsequent window 100, 110. The processor 20 stores these difference values into memory 14. At step 222, in each band the processor 20 identifies local maxima in the difference values by comparing each difference value with the difference values in the preceding five windows and the difference values in the subsequent five windows. If a difference value is greater than the difference values in the preceding five windows and greater than the difference values in the subsequent five windows it is labelled as a local maximum. A matrix of label values is assembled from this process which includes one value for each difference value.

In this matrix, the label values corresponding to local maxima are set to a value of one and other label values (non-maxima) are set to zero. These two process steps may be represented in mathematical notation, thus: D(i,j) = P(i,j) -P(i -1,]) 11 UD(i,j)=max[D(i-5:i+5;j)] 0 otherwise In which: P(4j) represents the energy or amplitude in the band j during time window D represents the difference values and Dmax is the matrix of label values which identify the local maxima.

At step 224, the processor 20 calculates a measure of the difference data, Da(i), based on the number of local maxima in each time window, Da(1) = max[LD1(i,j),LD2(i,j)], in which [1 f [D(i-1,j)+D(i,j)]>O D1(i,j)=i.

otherwise and Ii if [D(i,j)+D1(i+1,j)]>O 0 otherwise The quantity Da provides the number of locally significant increases in signal amplitude in a given time window, 100,110, D1 and D2 accommodate for frame peak misalignment.

At step 224 the processor 20 compares the number of locally significant increases in signal amplitude in a given time window with a threshold number. If the number is greater than the threshold then the window is labelled as an onset. An onset can be used to indicate the beginning of a musical note or other sound, in which the amplitude rises to an initial peak.

A frame is identified as containing an onset if there is a local maximum in the difference data for more than the threshold number of bands. In the example of Figure 3 the threshold number of local maxima is set to 4 but another threshold may be used depending upon the width and number of the frequency bands selected.

At step 226 the processor 20 extracts the spectral content data 100', 110' for each of the identified onsets. For each frequency band the onsets are assigned a rank according to the signal energy in that frequency band so that, in any given frequency band, the onset with the highest signal energy in that band is assigned a rank of one. Progressively lower ranks are assigned to onsets having progressively lower signal energy until the onset having the lowest energy in that band is assigned the last rank (equal to the total number of onsets). This process is repeated for each frequency band until every band in every onset has been assigned a rank.

At step 228, for each particular onset, the processor compares the rank of each band with a series of threshold rank values, n. The series of threshold rank values, n, being integers between one and the total number of onsets, N. At each threshold rank value, n, the processor determines the number of bands, T(n), of the particular onset which are of higher rank than that threshold rank value. For each onset, the result of this process is a series, T(n) comprising, N, integer values each labelled with a threshold rank value, n. For each onset, each integer in the series T(n) indicates the number of bands in that onset having a rank greater than or equal to n.

The final rank of each onset is determined at step 230 by evaluating the value of n at which T(n) is greater than or equal to a threshold rank value, K. To provide a measure of the total energy of the track the processor sums the final ranks of all the onsets and divides by the total number of onsets. This measure of energy of the track can be used to index the track in a database.

Figure 4A shows a flow diagram indicating a component of method of determining the vocal or instrumental content of a track. At step 400, the processor determines the short-time Fourier transform table of the track according to steps 52 to 58 described above with reference to Figure 2A. At step 402, for each window 100, 110 the processor maps the power (e.g. the square or absolute value of the amplitude or energy) of the frequency content data 100', 110' onto the Mel scale. This is a frequency scale based on pitch comparisons to provide a "perceptual scale" of pitches which are judged by listeners to be equal in distance from one another. A conversion from linear frequency (Hz) to the Mel scale is provided, thus inel=2595*log0(1+-1--) BW= (mel -meç) M+1 In which me/MAX and me/M,N denote the highest and lowest mel frequencies respectively obtained from converting the highest and lowest frequency in hertz (e.g. 0hz and about 10khz), and m denotes the number of bands.

The central frequencies, me/s on the Mel scale may be defined thus: mel(m)=m.BW,in=1...M And the central frequencies on the mel scale (in Hz), f (in)= 700* (1 o(mc(m)I2595) -1) fc[o, 66.6, 139.5, 219.4, 307.0, 402.8, 507.8, 622.7, 748.6, 886.5, 1037.6, 1202.9, 1384.1, 1582.4, 1799.6, 2037.6, 2298.1, 2583.5, 2896.0, 3238.2, 3613.1, 4023.6, 4473.1, 4965.5, 5504.7, 6095.3, 6742.0, 7450.3, 8226.1, 9075.6, 10000] Hz The boundaries of each band will be the central frequencies of the previous and the next band, respectively.

At step 404, in each Mel frequency band the processor determines the logarithm, for example the natural logarithm of the power values determined at step 402. At step 406 the processor determines the so called Mel frequency cepstral coefficients, MFCC, by taking the discrete cosine transform of the logarithmic values determined at step 404. The output of this process is the MFCC values for each window of frequency content data. Typically the first 13 MFC coefficients are used, but in some examples as many as 30 coefficients are used.

At step 408, for each sample of music (for example a one second sample comprising thirty time windows, 100, 110) the processor determines the mean, standard deviation, skewness and kurtosis of each MFCC, e.g. the first four statistical moments of the MFCC at each Mel frequency. This provides a fingerprint of four numeric values for each sample of music.

Figure 4B shows a method of indexing music tracks based on a method comprising the steps described above with reference to Figure 4A. At step 450 a learning database is assembled comprising a number of samples of music known to be either predominantly vocal or predominantly instrumental. At step 452 the method described above with reference to Figure 4A is performed on each sample of music in the training database. At step 453 the statistical moments of the MFCC coefficients produced by this method are indexed in a look-up table according to the degree to which the particular sample of the learning database is known to be a purely instrumental, or purely vocal sample of music, or a mixture of the two.

The learning database consists of a number of one second length samples labelled as either pure vocal or pure instrumental, Although some pure vocal samples may contain some instrumental components they may still be still categorised as pure vocal.

Step 454 a music track to be categorised is loaded into memory and subdivided into a series of samples, each sample being one second in length. At step 456, each sample is treated according to the method described above with reference to Figure 4A to provide the fingerprint (four statistical moments of the MFCC) of that sample.

At step 458 the fingerprint of each sample is compared with the look-up table derived from the training data at step 453 to assign a vocal/instrumental categorisation to each sample of the track. This process is repeated until all the samples of the track has been categorised. At step 458 the track is given an overall score based on the categories of its constituent samples. For example a music track can be categorised according to the percentage of its duration that includes vocals and/or the percentage of its duration that includes purely instrumental music. This provides a metric by which an audio track can be indexed in a database according to its vocal/instrumental content.

Figure 5 shows a method of configuring a database of music tracks. At step 602 a music track is assigned a first index value having a dependence upon frequency content of the audio signal according to the method described with reference to Figure 2A and Figure 2B. At step 604 the music track is assigned a second index value based on the degree to which the audio track is calming/energetic as measured according to the method described above with reference to Figure 3. At step 606 the music track is assigned a third index value based on the degree to which the music track is vocal or instrumental in content.

The resulting database provides a relative, quantitative scale which enables music tracks to be indexed and retrieved based on objective determination of their qualitative features. Importantly, unlike existing methods in which pieces of music are tagged with a category, "star-rating" or other subjective score, examples of the invention provide a quantitative relative scale against which different types of music can be judged. This has the advantage that a user can retrieve a piece of music based on it being, for example more energetic than one music track and, for example, less instrumental than another track. The provision of this quantitative relationship between music tracks enables users to efficiently locate music to suit a particular mood or requirement without needing to trawl through a large music library and without requiring the user to have specific knowledge of the music tracks from which they are selecting.

Of course, the drawings shown herein are merely schematic and should not be construed as limiting in any way. Although the functional elements of the apparatus of Figure 1 are shown as discrete units the function each unit provides may be integrated into one or more common units or components and/or distributed differently among separate components or units of a device or a geographically distributed system.

Claims

Claims 1. A computer implemented method comprising: providing a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, configuring a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal.
2. The method of claim 1 in which frequency content of the audio signal comprises a plurality of frequency data windows and each frequency data window is based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal.
3. The method of claim 2 in which having a dependence upon frequency content of the audio signal comprises having a dependence upon a centre of gravity of frequency content of the audio signal.
4. The method of claim 3 in which the centre of gravity is determined based on a weighted average of the frequency content.
5. The method of claim 4 in which the weighting of a frequency band in the weighted average is based on the amplitude or energy of the audio signal at that band.
6. The method of claim 4 or 5 in which the weighting of the weighted average comprises a weighting function selected to scale the amplitude or energy of the frequency content based on frequency band.
7. The method of claim 6 in which being based on the total amplitude or energy comprises being based on a measure of loudness.
8. The method of claim 7 in which the loudness of a band comprises a power law function of the energy or amplitude of the audio signal in that band.
9. The method of claim 8 in which the weighted average is normalised by the total loudness.
10. The method of any of claims 4 to 9 in which the weighted average comprises the mean of the weighted averages of a plurality of the frequency data windows.
11. The method of any preceding claim in which the frequency content comprises a plurality of data values each based on the energy or amplitude of the audio signal in a corresponding one of a plurality of frequency bands.
12. The method of claim 11 in which the frequency bands are selected to correspond to the frequency response of human hearing.
13. The method of claim 12 in which the frequency bands have band edges which correspond to band edges of critical bands of human hearing.
14. The method of claim 12 or 13 in which the frequency bands comprise at least some of the Bark bands.
15. The method of any of claims 2 to 14 in which each of the plurality of portions of the audio signal is selected to overlap the preceding portion by half of the duration of the portion or less.
16. The method of any of claims 2 to 15 further comprising scaling the audio signal in each of the plurality of portions using a data taper, such as a Hamming window.
17. The method of any of claims 12 to 16 in which the index data value is based on the sum of the weighted average taken across the plurality of portions.
18. The method of any of claims 1 to 10 in which the frequency content comprises frequency transform data of the audio signal in which each frequency band comprises a plurality of frequency components of the frequency transform data.
19. The method of claim 2 in which having a dependence upon frequency content of the audio signal comprises having a dependence on the number of onsets in the audio signal data.
20. The method of claim 19 in which the dependence comprises a dependence upon difference data based on frequency content differences frequency data windows.
21. The method of claim 20 in which the difference data is based on frequency content differences between of frequency data windows in at least one selected frequency band, preferably in a plurality of selected frequency bands.
22. The method of claim 20 or 21 in which the difference data are based on differences which exceed a threshold difference value.
23. The method of claim 22 in which the threshold difference value is dependent on said differences.
24. The method of claim 23 in which the threshold difference value is determined for each of said time intervals based on differences associated with at least one subsequent and/or preceding time interval.
25. The method of claim 24 in which the at least one subsequent and/or preceding time interval comprises a plurality of subsequent and/or preceding time intervals.
26. The method of claim 25 in which the subsequent and/or preceding time intervals are selected using a sliding window.
27. The method of any of claims 22 to 26 in which the threshold difference value comprises a threshold difference value for each respective frequency band.
28. The method of claim 27 in which the difference data for each time interval comprises a count value based on the number of frequency bands in which the difference between (a) frequency content of the frequency data window associated with that time interval and (b) the frequency content of at an adjacent window, exceed that frequency band's threshold difference value.
29. The method of claim 27 in which the difference data for each time interval comprises a count value based on the greater of; (i) the number of frequency bands in which the difference between frequency content of the frequency data window associated with that time interval and the frequency content of the preceding window, exceed that frequency band's threshold difference value; and, (ii) the number of frequency bands in which the difference between frequency content of the frequency data window associated with that time interval and the frequency content of the subsequent window, exceed that frequency band's threshold difference value.
30. The method of claim 28 or 29 in which the difference data are based on comparing the total count value with a selected count value threshold to identify onsets.
31. The method of claim 30 in which having a dependence upon the number of onsets comprises having a dependence upon the energy or amplitude of the audio signal data during a time interval associated with each respective identified onset.
32. The method of claim 31 in which having a dependence upon the energy or amplitude of the audio signal data comprises being based on a ranking of the onsets based on the energy or amplitude audio signal data 33. The method of claim 32 in which onsets associated with a time interval having a relatively higher signal energy or amplitude are assigned a ranking indicating a greater significance and onsets associated with a time interval having a relatively lower signal energy or amplitude are assigned a ranking indicating a relatively lower significance.34. The method of claim 34 in which each onset has one ranking for each frequency band and each onset is assigned an overall rank based on the rank of a selected number of the most significantly ranked frequency bands.35. The method of claim 34 in which having a dependence on the number of onsets comprises having a dependence upon the sum of the overall ranks of the onsets in the audio signal data.36. The method of claim 1 in which having a dependence upon frequency content of the audio signal comprises being based on a comparison of a fingerprint of the frequency content of the audio signal with reference fingerprint values.37. The method of claim 36 in which the frequency content of the audio signal comprises content of at least one segment of the audio signal, wherein the segment comprises a plurality of said time intervals and the fingerprint comprises at least one statistical moment of frequency content data of said segment in a selected frequency band.38. The method of claim 37 in which the statistical moment is selected from a list comprising: the mean, standard deviation, skewness, kurtosis and a higher order statistical moment of the distribution of signal energy or amplitude in the selected frequency band across said plurality of time intervals.39. The method of claim 38 in which the list further comprises the median.40. The method of claim 37, 38 or 39 in which the fingerprint comprises at least one statistical moment of frequency content data of said segment in a plurality of said selected frequency bands.41. The method of claim 40 in which the selected frequency bands comprise Mel Frequency Bands and the frequency content data comprises logarithms of the signal energy, power or amplitude.42. The method of claim 41 in which the frequency content data comprises the Mel Frequency Cepstral Coefficients.43. The method of any of claims 37 to 42 in which each segment is categorised based on said comparison of its fingerprint and the index data value indicates the number of segments having a selected category 44. The method of claim 43 in which the category indicates the presence of vocal or instrumental audio content in the segment.45. The method of claim 1 in which each association indicates an association between a respective one of the plurality of data files and a respective plurality of index data values comprising: one index data value having a dependence upon frequency content of the audio signal as defined in any of claims 2 to 18; one index data value having a dependence upon frequency content of the audio signal as defined in any of claims 19 to 35; and, one index data value having a dependence upon frequency content of the audio signal as defined in any of claims 36 to 44 46. The method of any preceding claim in which the frequency content is determined using a frequency transform to decompose the audio signal into constituent frequencies, preferably wherein the transform operation comprises one of: a Fourier transform; a fast Fourier transform; a discrete cosine transform; a wavelet transform; a Laplace transform.47. A computer implemented method comprising: providing an audio data file comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, providing an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal.48. The method of claim 47 in which providing the audio data file comprises transmitting the audio data file over a network in response to a request, in which the request comprises an index data value indicating frequency content of a requested audio data file.49. The method of claim 47 in which providing the audio data file comprises retrieving the audio data file from a memory of an audio player in response to a request comprising an index data value indicating frequency content of a requested audio data file.50. The method of any of claims 47 to 49 in which the index data value has the features of any of claims 2 to 49.51. An audio data file comprising: a header and audio signal data, the header comprising data to enable music playing software to decode the audio signal data to provide an audio signal; and, at least one index data value in addition to the audio signal data configured to enable the audio data file to be indexed in a database, wherein the index data value has a dependence upon frequency content of the audio signal.52. The audio data file of claim 48 in which the index data value has the features defined in any of claims I to 46.53. A database of audio data files configured according to the method of any of claims I to 46.54. A computer readable storage medium comprising a database according to claim 53.55. A music player comprising a computer readable storage medium according to claim 54, preferably in which the music player is a handheld device.56. A computer program product comprising program instructions operable to program a programmable processor to perform a method according to any of claims 1 to 46.57. A data structure for providing an audio data file according to claim 51 or 52 or a database according to claim 53.58. A network message comprising a data structure according to claim 57.59. A computing apparatus comprising: a database comprising a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal, and a plurality of associations, each indicating an association between a respective one of the plurality of data files and at least one index data value to enable the data file to be indexed in the database, wherein the at least one index data value has a dependence upon frequency content of the audio signal; a processor configured to receive a request comprising at least one request index data value and to retrieve a file indicator from the database based on the at least one request index data value.60. The apparatus of claim 59 in which the at least index value comprises a plurality of index data values each having a dependence upon frequency content of the audio signal selected from the following list: the dependence defined in any of claims 2 to 18; the dependence defined in any of claims 19 to 35; and the dependence defined in any of claims 36 to 44.61. The apparatus of claim 59 or 60 in which the file indicator comprises indicators which correspond to a selected plurality of the audio data files.62. The apparatus of claim 61 in which the selected plurality of audio data files are selected based on the closest available matches in the database to the request index data value.63. The apparatus of claim 62 in which the selected plurality comprises a selected number of audio data files.64. The apparatus of claim 63 in which the processor is configured to set the selected number based on a received network message.65. The apparatus of any of claims 59 to 64 in which the file indicator comprises the name of an audio track associated with the file and/or a graphical representation associated with the audio track.66. The apparatus of any of claims 59 to 65 in which the processor is configured to provide the file indicator to a network for retrieval by or transmission to communications device over said network.67. The apparatus of claim 66 in which the processor is further configured to retrieve an audio data file corresponding to the index data value and/or the file indicator for transmission over the network.68. A network server comprising an apparatus according to any of claims 59 to 67.69. A computing apparatus comprising storage means for storing audio data file comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; a processor configured to determine frequency content of the audio signal data, said frequency content comprising a plurality of frequency data windows each based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal, wherein the processor is configured to determine an index data value based on said frequency content.70. The apparatus of claim 69 in which the processor is configured to determine an index data value having the features defined in any of claims 3 to 49.Amendments to the claims have been filed as follows Claims 1. A computer implemented method comprising: providing a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, configuring a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon a centre of gravity of frequency content of the audio signal, in which the centre of gravity is determined based on a weighted average of the frequency content and the weighting of the weighted average comprises a weighting function selected to scale the amplitude or energy of the frequency content based on frequency band. r2. The method of claim I in which frequency content of the audio signal comprises a plurality of frequency data windows and each frequency data window is based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal.3. The method of claim 2 in which the frequency content comprises a measure of loudness.4. The method of claim 3 in which the loudness of a band comprises a power law function of the energy or amplitude of the audio signal in that band.5. The method of claim 4 in which the power law function comprises an exponential function.6. The method of claim 4 in which the weighted average is normalised by the total loudness.7. The method of any of claims 2 to 6 in which the weighted average comprises the mean of the weighted averages of a plurality of the frequency data windows.8. The method of any preceding claim in which the frequency content comprises a plurality of data values each based on the energy or amplitude of the audio signal in a corresponding one of a plurality of frequency bands.9. The method of claim 8 in which the frequency bands are selected to correspond to the frequency response of human hearing.10. The method of claim 9 in which the frequency bands have band edges which correspond to band edges of critical bands of human hearing.11. The method of claim 9 or 10 in which the frequency bands comprise at least some of the Bark bands. r12. The method of any of claims 2 to 11 in which each of the plurality of portions of the audio signal is selected to overlap the preceding portion by half of the duration of the portion or less.13. The method of any of claims 2 to 12 further comprising scaling the audio signal in each of the plurality of portions using a data taper, such as a Hamming window.14. The method of any of claims I to 7 in which the frequency content comprises frequency transform data of the audio signal in which each frequency band comprises a plurality of frequency components of the frequency transform data.15. A computer implemented method comprising: providing a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, configuring a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon the number of onsets in the frequency content of the audio signal, in which the frequency content of the audio signal comprises a plurality of frequency data windows and each frequency data window is based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal, wherein onsets are identified from difference data based on differences in frequency content between frequency data windows which exceed a threshold difference value; and in which the threshold difference value comprises a threshold difference value for each of a plurality of frequency bands of the frequency content and the difference data for each time interval comprises a count value based on the number of frequency bands in which the difference between 1" (a) frequency content of the frequency data window associated with that time interval; and C\J (b) the frequency content of an adjacent frequency data window, exceed that frequency band's threshold difference value. c\J16. The method of claim 15 in which the threshold difference value is dependent on said differences.17. The method of claim l6in which the threshold difference value is determined for each of said time intervals based on differences associated with at least one subsequent and/or preceding time interval.18. The method of claim 17 in which the at least one subsequent and/or preceding time interval comprises a plurality of subsequent and/or preceding time intervals.19. The method of claim 18 in which the subsequent and/or preceding time intervals are selected using a sliding window.20. The method of claim 16 in which the adjacent window comprises a selected one of the subsequent and preceding windows wherein the selection is based on (i) the number of frequency bands in which the difference between frequency content of the frequency data window associated with that time interval and the frequency content of the preceding window, exceed that frequency band's threshold difference value; and, (ii) the number of frequency bands in which the difference between frequency content of the frequency data window associated with that time interval and the frequency content of the subsequent window, exceed that frequency band's threshold difference value.21. The method of claim 16 or 20 in which the difference data are based on comparing the total count value with a selected count value threshold to identify onsets.22. The method of claim 21 in which having a dependence upon the number of onsets comprises having a dependence upon the energy or amplitude of the audio signal data during a time interval associated with each respective identified onset. c\J23. The method of claim 22 in which having a dependence upon the energy or amplitude of the audio signal data comprises being based on a ranking of the onsets based on the energy or amplitude audio signal data.24. The method of claim 23 in which onsets associated with a time interval having a relatively higher signal energy or amplitude are assigned a ranking indicating a greater significance and onsets associated with a time interval having a relatively lower signal energy or amplitude are assigned a ranking indicating a relatively lower significance.25. The method of claim 24 in which each onset has one ranking for each frequency band and each onset is assigned an overall rank based on the rank of a selected number of the most significantly ranked frequency bands.26. The method of claim 25 in which having a dependence on the number of onsets comprises having a dependence upon the sum of the overall ranks of the onsets in the audio signal data.27. A computer implemented method comprising: providing a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, configuring a database by providing a plurality of associations, each indicating an association between a respective one of the plurality of data files and an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal.a comparison of a fingerprint of the frequency content of the audio signal with reference fingerprint values and the frequency content of the audio signal comprises content of at least one segment of the audio signal, wherein the segment comprises a plurality of said time intervals and the fingerprint comprises at least one statistical moment of frequency content data of said segment in a selected frequency band, r wherein the statistical moment is selected from a list comprising: standard C\.J deviation skewness, kurtosis and a higher order statistical moment of the distribution of signal energy or amplitude in the selected frequency band across said plurality of time intervals.28. The method of claim 27 in which the list further comprises the median.29. The method of claim 27 or 28 in which the fingerprint comprises at least one statistical moment of frequency content data of said segment in a plurality of said selected frequency bands.30. The method of claim 29 in which the selected frequency bands comprise Mel Frequency Bands and the frequency content data comprises logarithms of the signal energy, power or amplitude.31. The method of claim 30 in which the frequency content data comprises the Mel Frequency Cepstral Coefficients.32. The method of any of claims 27 to 31 in which each segment is categorised based on said comparison of its fingerprint and the index data value indicates the number of segments having a selected category.
33. The method of claim 32 in which the category indicates the presence of vocal or instrumental audio content in the segment.
34. The method of claim I in which the association also indicates an association between the respective one of the plurality of data files and an index data value having the features defined a dependence upon frequency content of the audio signal asdefined in any of claims 15 to 26; and, one index data value having a dependence upon frequency content of the audio signal as defined in any of claims 27 to 33.
35. The method of any preceding claim in which the frequency content is determined r using a frequency transform to decompose the audio signal into constituent frequencies, preferably wherein the transform operation comprises one of: a Fourier transform; a fast ("1 Fourier transform; a discrete cosine transform; a wavelet transform; a Laplace transform. c\J36. A computer implemented method comprising: providing an audio data file comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; and, providing an index data value to enable the data file to be indexed in the database, wherein the index data value has a dependence upon frequency content of the audio signal as defined in any of claims I to 33.37. The method of claim 36 in which providing the audio data file comprises transmitting the audio data file over a network in response to a request, in which the request comprises an index data value indicating frequency content of a requested audio data file.38. The method of claim 36 in which providing the audio data file comprises retrieving the audio data file from a memory of an audio player in response to a request comprising an index data value indicating frequency content of a requested audio data file.39. The method of any of claims xxto xx in which the index data value has the features of any of claims 2 to 49.40. An audio data file comprising: audio signal data; and, at least one index data value in addition to the audio signal data, configured to enable the audio data file to be indexed in a database, wherein the index data value has a dependence upon frequency content of the audio signals as defined in any of claims I to 33..41. A database of audio data files configured according to the method of any of claims I to 33. r42. A computer readable storage medium comprising a database according to claim 41. c\J43. A music player comprising a computer readable storage medium according to claim 42, preferably in which the music player is a hand held device.44. A computer program product comprising program instructions operable to program a programmable processor to perform a method according to any of claims I to 33.45. A data structure for providing an audio data file according to claim 40 a database according to claim 41.46. A network message comprising a data structure according to claim 45.47. A computing apparatus comprising: a database comprising a plurality of audio data files, each comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal, and a plurality of associations, each indicating an association between a respective one of the plurality of data files and at least one index data value to enable the data file to be indexed in the database, wherein the at least one index data value has a dependence upon frequency content of the audio signal as defined in any of claims I to 34; a processor configured to receive a request comprising at least one request index data value and to retrieve a file indicator from the database based on the at least one request index data value.48. The apparatus of claim 47 in which the file indicator comprises indicators which correspond to a selected plurality of the audio data files.49. The apparatus of claim 48 in which the selected plurality of audio data files are selected based on the closest available matches in the database to the request index r data value. c\J50. The apparatus of claim 49 in which the selected plurality comprises a selected number of audio data files.51. The apparatus of claim 50 in which the processor is configured to set the selected number based on a received network message.52. The apparatus of any of claims 47 to 51 in which the file indicator comprises the name of an audio track associated with the file and/or a graphical representation associated with the audio track.53. The apparatus of any of claims 47 to 52 in which the processor is configured to provide the file indicator to a network for retrieval by or transmission to a communications device over said network.54. The apparatus of claim 53 in which the processor is further configured to retrieve an audio data file corresponding to the index data value and/or the file indicator for transmission over the network.55. A network server comprising an apparatus according to any of claims 47 to 52 56. A computing apparatus comprising storage means for storing audio data file comprising a header and audio signal data, the header comprising data to enable play-back software to decode the audio signal data to provide an audio signal; a processor configured to determine frequency content of the audio signal data, said frequency content comprising a plurality of frequency data windows each based on frequency content of the audio signal at a time interval associated with a corresponding one of a plurality of portions of the audio signal, wherein the processor is configured to determine an index data value having the features defined in any of claims I to 33 based on said frequency content. r c\J c\J c\J