RU2662939C1

RU2662939C1 - Method for identification of musical works

Info

Publication number: RU2662939C1
Application number: RU2017116448A
Authority: RU
Inventors: Денис Павлович Кузнецов; Максим Андреевич Петров; Ваган Арменович Саруханов
Original assignee: Общество с ограниченной ответственностью "ИСКОНА ХОЛДИНГ"
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2018-07-31

Abstract

FIELD: data processing.SUBSTANCE: invention relates to the technique of analyzing reproduced musical works and can be used to identify musical works, control the authorship of musical works. Input information data in digital form, characterizing the reproduced musical work, is received. Digital audio stream is divided into a set of fragments of fixed duration. Set of fragments is converted into a set of frequency spectrum using a fast Fourier transform. Set of frequency spectrum is converted into a set of identification indicators. Identification indicators, which characterize the reproduced music, and identification indicators, which characterize the original pieces of music, are compared, and on the basis of such a comparative analysis a conclude is made that it is a reproduction of this piece of music. In this case, the transformation of a set of frequency spectrum into a set of identification indicators is performed using an artificial convolutional neural network.EFFECT: technical result consists in improving the quality of identification due to the possibility of using features that unambiguously characterize the reproduced musical work.1 cl, 2 dwg

Description

Изобретение относится к технике анализа воспроизводимых музыкальных произведений и может быть использовано для идентификации музыкальных произведений, контроля авторства музыкальных произведений, сбора соответствующих статистических данных.The invention relates to techniques for analyzing reproduced musical works and can be used to identify musical works, control authorship of musical works, and collect relevant statistical data.

Известен способ идентификации музыкальных произведений, в котором осуществляют прием входных информационных данных в цифровом виде, характеризующих воспроизводимое музыкальное произведение, производят разбиение цифрового аудиопотока на набор фрагментов фиксированной длительности, осуществляют преобразование набора фрагментов в набор спектрограмм или частотных спектров при помощи быстрого преобразования Фурье, осуществляют преобразование набора частотных спектров в набор аудиоотпечатков, сравнивают аудиоотпечатки, характеризующие воспроизводимое музыкальное произведение, и аудиоотпечатки, характеризующие оригинальные музыкальные произведения, и на основе такого сравнительного анализа делают вывод о том, что имеет место воспроизведение этого музыкального произведения (см. патент РФ на полезную модель №81614, МПК H04N 7/173, публ. 2006 г.). К недостаткам известного способа можно отнести недостаточное качество идентификации, невысокую шумоустойчивость и использование сложного алгоритма идентификации.A known method of identifying musical works, in which the input information is received in digital form characterizing the reproduced musical work, the digital audio stream is divided into a set of fragments of a fixed duration, the set of fragments is converted to a set of spectrograms or frequency spectra using a fast Fourier transform, a set of frequency spectra in a set of audio fingerprints, compare audio fingerprints, characterization reproducing a musical work, and audio prints characterizing the original musical works, and on the basis of such a comparative analysis conclude that there is a reproduction of this musical work (see RF patent for utility model No. 81614, IPC H04N 7/173, publ. 2006). The disadvantages of this method include the poor quality of identification, low noise immunity and the use of a complex identification algorithm.

Наиболее близким по технической сущности к предлагаемому является способ идентификации музыкальных произведений, в котором осуществляют прием входных информационных данных в цифровом виде, характеризующих воспроизводимое музыкальное произведение, производят разбиение цифрового аудиопотока на набор фрагментов фиксированной длительности, осуществляют преобразование набора фрагментов в набор спектрограмм или частотных спектров при помощи быстрого преобразования Фурье, осуществляют преобразование набора спектрограмм в набор идентификационных показателей (аудиоотпечатки), сравнивают идентификационные показатели, характеризующие воспроизводимое музыкальное произведение, и идентификационные показатели, характеризующие оригинальные музыкальные произведения, и на основе такого сравнительного анализа делают вывод о том, что имеет место воспроизведение этого музыкального произведения ((см., например, Shazam: алгоритмы распознавания музыки, сигнатуры, обработка данных, https://habrahabr.ru, или патент US 6990453). К недостаткам известного способа также можно отнести недостаточное качество идентификации, невысокую шумоустойчивость и использования сложного алгоритма идентификации.The closest in technical essence to the proposed one is a method of identifying musical works, in which the input information is received in digital form characterizing the reproduced musical work, the digital audio stream is divided into a set of fragments of a fixed duration, and the set of fragments is converted to a set of spectrograms or frequency spectra when using the fast Fourier transform, they transform a set of spectrograms into a set of identifiers identification indicators (audio fingerprints), the identification indicators characterizing the reproduced musical work are compared with the identification indicators characterizing the original musical works, and on the basis of such comparative analysis they conclude that this musical work is being reproduced ((see, for example, Shazam : music recognition algorithms, signatures, data processing, https://habrahabr.ru, or US patent 6990453). The disadvantages of this method can also be attributed to insufficient identification quality, low noise resistance and the use of a complex identification algorithm.

Предлагаемый способ направлен на решение задачи и достижение технического результата, состоящего в повышении качества идентификации, возможности использования упрощенного алгоритма идентификации за счет возможности использования признаков, однозначно характеризующих воспроизводимое музыкальное произведение, и при этом имеется возможность повышения эффективности за время применения.The proposed method aims to solve the problem and achieve a technical result, which consists in improving the quality of identification, the possibility of using a simplified identification algorithm due to the possibility of using features that uniquely characterize a reproduced piece of music, and at the same time there is a possibility of increasing efficiency during application.

Данный технический результат достигается тем, что в способе идентификации музыкальных произведений, в котором осуществляют прием входных информационных данных в цифровом виде, характеризующих воспроизводимое музыкальное произведение, производят разбиение цифрового аудиопотока на набор фрагментов фиксированной длительности, осуществляют преобразование набора фрагментов в набор частотных спектров при помощи быстрого преобразования Фурье, осуществляют преобразование набора частотных спектров в набор идентификационных показателей, сравнивают идентификационные показатели, характеризующие воспроизводимое музыкальное произведение, и идентификационные показатели, характеризующие оригинальные музыкальные произведения, и на основе такого сравнительного анализа делают вывод о том, что имеет место воспроизведение этого музыкального произведения, при этом преобразование набора частотных спектров в набор идентификационных показателей осуществляют с использованием искусственной сверточной нейронной сети с получением на выходе в качестве идентификационных показателей, характеризующих воспроизводимое музыкальное произведение, карты опорных векторов, сравнивают попарно карты опорных векторов, характеризующих воспроизводимое музыкальное произведение, и карты опорных векторов, характеризующих оригинальные музыкальные произведения, с определением расстояния между картами опорных векторов, и при совпадении этих расстояний в отношении какого-либо оригинального музыкального произведения не менее определенного порогового значения делают вывод о том, что имеет место воспроизведение этого музыкального произведения.This technical result is achieved by the fact that in the method of identifying musical works, in which the input information is received in digital form, characterizing the reproduced musical work, the digital audio stream is divided into a set of fragments of a fixed duration, the set of fragments is converted to a set of frequency spectra using fast Fourier transforms, transform a set of frequency spectra into a set of identification indicators , compare the identification indicators characterizing the reproduced musical work, and identification indicators characterizing the original musical works, and based on such a comparative analysis conclude that there is a reproduction of this musical work, while the conversion of the set of frequency spectra into the set of identification indicators is carried out with using an artificial convolutional neural network with output as identification displays of the characters characterizing the reproduced musical work, the maps of the reference vectors, compare in pairs the maps of the reference vectors characterizing the reproduced musical composition and the maps of the reference vectors characterizing the original musical works, with the determination of the distance between the maps of the reference vectors, and if these distances coincide with respect to any an original piece of music of at least a certain threshold value concludes that there is a reproduction of this music nogo product.

Осуществление преобразования набора частотных спектров в набор идентификационных показателей с использованием искусственной сверточной нейронной сети с получением на выходе в качестве идентификационных показателей, характеризующих воспроизводимое музыкальное произведение, карты опорных векторов, позволяет повысить качество идентификации и обеспечить возможность использования упрощенного алгоритма идентификации за счет использования признаков (карты опорных векторов), представляющих собой массивы чисел фиксированной размерности, однозначно характеризующих воспроизводимое музыкальное произведение, которые устойчивы к искажению и зашумлению. При этом имеется возможность повышения эффективности способа за время применения, так как искусственная сверточная нейронная сеть имеет свойство формироваться самостоятельно в процессе функционирования, например, путем обучения сети классическим методом обратного распространения ошибки (см. Сверточная нейронная сеть. Материал из Википедии. https://wikipedia.org/wiki).The conversion of a set of frequency spectra into a set of identification indicators using an artificial convolutional neural network with the output of reference vector cards as identification indicators characterizing the reproduced musical work allows to improve the quality of identification and to provide the possibility of using a simplified identification algorithm by using features (cards reference vectors), which are arrays of fixed numbers dimensionally TI is uniquely characterized by the playback of music, which are resistant to distortion and noise. At the same time, it is possible to increase the efficiency of the method during the application, since the artificial convolutional neural network tends to form independently during operation, for example, by training the network using the classical method of back propagation of errors (see Convolutional neural network. Material from Wikipedia. Https: // wikipedia.org/wiki).

Сравнение попарно карт опорных векторов, характеризующих воспроизводимое музыкальное произведение, и карт опорных векторов, характеризующих оригинальные музыкальные произведения, производится с определением расстояния между картами опорных векторов, например, с использованием классической формулы Евклидовой метрики для вычисления расстояния между векторами, и при совпадении этих расстояний в отношении какого-либо оригинального музыкального произведения не менее определенного порогового значения делают вывод о том, что имеет место воспроизведение этого музыкального произведения, также позволяет повысить качество идентификации, так как сравнение осуществляется на основе карт опорных векторов, представляющих собой массивы чисел фиксированной размерности, однозначно характеризующих воспроизводимое музыкальное произведение, которые устойчивы к искажению и зашумлению, а также позволяет использовать более простой, по сравнению с аудиоотпечатками, алгоритм поиска по базе данных.The comparison of reference vector maps characterizing the reproduced piece of music and reference vector maps characterizing the original musical compositions is performed by determining the distance between the reference vector maps, for example, using the classical Euclidean metric formula to calculate the distance between the vectors, and if these distances coincide in with respect to any original musical work of at least a certain threshold value, they conclude that there are places The reproduction of this musical work also improves the quality of identification, since the comparison is carried out on the basis of reference vector maps, which are arrays of numbers of fixed dimension, which uniquely characterize the reproduced musical work, which are resistant to distortion and noise, and also allows the use of a simpler, Compared to audio fingerprints, database search algorithm.

На фиг. 1 представлен пример изображения спектрограмм, подаваемых на вход искусственной сверточной нейронной сети; на фиг. 2 - пример изображения с картой опорных векторов на выходе из искусственной сверточной нейронной сети.In FIG. 1 shows an example of the image of spectrograms fed to the input of an artificial convolutional neural network; in FIG. 2 is an example of an image with a map of reference vectors at the output of an artificial convolutional neural network.

Звуковой сигнал от воспроизводимого музыкального произведения представляют как входные информационные данные или аудиопоток в цифровом виде, который разбивают на набор фрагментов фиксированной длительности, и осуществляют преобразование набора фрагментов в набор частотных спектров при помощи быстрого преобразования Фурье (см. фиг. 1). На данном этапе указанные операции совпадают с операциями по способу идентификации музыкальных произведений Shazam. Затем осуществляется преобразование набора частотных спектров в набор идентификационных показателей с использованием искусственной сверточной нейронной сети с получением на выходе в качестве идентификационных показателей, характеризующих воспроизводимое музыкальное произведение, карты опорных векторов.The sound signal from the reproduced musical work is represented as input information or an audio stream in digital form, which is divided into a set of fragments of a fixed duration, and the set of fragments is converted to a set of frequency spectra using a fast Fourier transform (see Fig. 1). At this stage, these operations coincide with operations by the method of identification of Shazam musical works. Then, the conversion of the set of frequency spectra into the set of identification indicators is carried out using an artificial convolutional neural network with the output as identification indicators characterizing the reproduced piece of music, maps of support vectors.

Массив частотных спектров, явный вид которых представляется в виде матриц строго определенного размера 128×128 пикселя, подается на вход сверточной нейронной сети (deep neural network, DNN). Соответственно размер входного слоя нейросети равен 128×128×1.An array of frequency spectra, the explicit form of which is represented as matrices of a strictly defined size of 128 × 128 pixels, is fed to the input of a convolutional neural network (deep neural network, DNN). Accordingly, the size of the input layer of the neural network is 128 × 128 × 1.

Первый скрытый слой нейросети представляет собой 32 различных сверточных фильтра размером 3×3×1. Размер сверточного слоя получается равным 32×3×3×1. На выходе имеем 32 карты размером 64×64. Второй скрытый слой выполняет функцию объединения максимумов из выходов первого слоя. Для каждого региона размером 3×3 выбирается максимальный элемент, регион выбирается с шагом 2. Таким образом, размер данного объединяющего слоя равен 32×3×3×32, а его выходом являются карты размером 32×32.The first hidden layer of the neural network is 32 different convolutional filters 3 × 3 × 1 in size. The size of the convolutional layer is 32 × 3 × 3 × 1. At the output, we have 32 cards 64 × 64 in size. The second hidden layer performs the function of combining the highs from the outputs of the first layer. For each region with a size of 3 × 3, the maximum element is selected, the region is selected in steps of 2. Thus, the size of this merging layer is 32 × 3 × 3 × 32, and its output is 32 × 32 cards.

Далее опять следует сверточный слой - третий скрытый слой, представляющий собой 64 фильтров размером 3×3. Физический смысл этого слоя - извлечение низкоуровневых особенностей для каждого пространственного участка спектрограммы. В качестве особенностей имеются в виду: границы, текстуры.Then again comes a convolutional layer — the third hidden layer, which is 64 filters 3 × 3 in size. The physical meaning of this layer is to extract low-level features for each spatial portion of the spectrogram. As features, we mean: borders, textures.

Следующие 3 слоя размером соответственно 16×16×64, 8×8×64, 4×4×32 последовательно уменьшают размерность данных - примитивных особенностей, объединяя их в связные группы, которые уже характеризуют формы и особенности частотных спектров. Размер выхода последнего слоя 32 карт размером 4×4.The next 3 layers, respectively 16 × 16 × 64, 8 × 8 × 64, 4 × 4 × 32, sequentially reduce the dimension of the data — primitive features, combining them into connected groups that already characterize the shapes and features of the frequency spectra. The output size of the last layer is 32 cards 4 × 4 in size.

Выход этого слоя трактуется как сырое представление уникальных признаков музыкального произведения: частотные особенности, наличие вокала, набор инструментов и т.д. Однако прямо и однозначно связать эти значения с реальными размерами на изображении нельзя. Слой обучен таким образом, чтобы каждая особенность минимально коррелировала с любой другой. Выходной вектор используется в качестве вектора-идентификатора, представленного на изображении частотных спектров, используется для ее идентификации. Карта опорных векторов на выходе из искусственной сверточной нейронной сети представлена на фиг. 2. В базе данных уже имеются карты опорных векторов, характеризующие оригинальные музыкальные произведения, которые предварительно получены также с использованием искусственной сверточной нейронной сети. Для всех сверточных слоев, как показали наши исследования, наиболее целесообразно для данной нейронной сети в качестве функции активации использовать функцию ELU (Exponential Linear Unit, сама по себе данная функция известна, см. http://datareview.info/article/obuchaem-). Сравнение опорных векторов характеризует воспроизводимое и оригинальное музыкальное производение, например, с использованием классической формулы Евклидовой метрики для вычисления расстояния между векторами. При совпадении этих расстояний в отношении какого-либо оригинального музыкального произведения не менее определенного порогового значения (обычно не менее 0,75) делают вывод о том, что имеет место воспроизведение этого музыкального произведения.The output of this layer is treated as a crude representation of the unique features of a musical work: frequency features, the presence of vocals, a set of instruments, etc. However, it is impossible to directly and unambiguously associate these values with the actual dimensions in the image. The layer is trained in such a way that each feature minimally correlates with any other. The output vector is used as an identifier vector, presented on the image of the frequency spectra, used to identify it. The map of reference vectors at the exit from the artificial convolutional neural network is shown in FIG. 2. The database already contains maps of reference vectors characterizing original musical works that were previously obtained also using an artificial convolutional neural network. For all convolutional layers, as our studies have shown, it is most advisable for a given neural network to use the ELU function as an activation function (Exponential Linear Unit, this function itself is known, see http://datareview.info/article/obuchaem-) . Comparison of support vectors characterizes reproduced and original musical production, for example, using the classical Euclidean metric formula to calculate the distance between vectors. If these distances coincide with respect to any original musical work of at least a certain threshold value (usually at least 0.75), it is concluded that there is a reproduction of this musical work.

Таким образом, заявленный способ идентификации музыкальных произведений обеспечивает повышение качества и точности распознавания за счет применения нейросети для обработки массива частотных спектров, полученных в результате обработки музыкального произведения, использующей всю доступную информацию из массива спектров, с использованием признаков, однозначно характеризующих воспроизводимое музыкальное произведение, а также обладающей возможностью обучения и повышения эффективности за время применения.Thus, the claimed method for identifying musical works provides an increase in the quality and accuracy of recognition through the use of a neural network for processing an array of frequency spectra obtained as a result of processing a musical work using all available information from an array of spectra, using features that uniquely characterize the reproduced musical work, and also with the ability to learn and improve efficiency during application.

Claims

1. A method of identifying musical works in which the input information is received in digital form characterizing the reproduced musical work, the digital audio stream is divided into a set of fragments of a fixed duration, the set of fragments is converted to a set of frequency spectra using a fast Fourier transform, the set is converted frequency spectra into a set of identification indicators, compare identification indicators, character that reproduce a reproduced musical work, and identification indicators characterizing original musical works, and on the basis of such comparative analysis conclude that there is a reproduction of this musical work, characterized in that the conversion of the set of frequency spectra into the set of identification indicators is carried out using artificial convolution neural network with the output as identification indicators characterizing reproducibly e musical piece, maps of reference vectors, compare in pairs the maps of reference vectors characterizing the reproduced musical composition and the maps of reference vectors characterizing the original musical compositions, with the determination of the distance between the maps of the reference vectors, and if these distances coincide with respect to any original musical piece at least a certain threshold value concludes that there is a reproduction of this musical work.

2. A method for identifying musical works according to claim 1, characterized in that all convolutional layers use the ELU function as an activation function.