CN112182281A - Audio recommendation method and device and storage medium - Google Patents
Audio recommendation method and device and storage medium Download PDFInfo
- Publication number
- CN112182281A CN112182281A CN201910604904.8A CN201910604904A CN112182281A CN 112182281 A CN112182281 A CN 112182281A CN 201910604904 A CN201910604904 A CN 201910604904A CN 112182281 A CN112182281 A CN 112182281A
- Authority
- CN
- China
- Prior art keywords
- image
- information
- text
- audio
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004364 calculation method Methods 0.000 claims abstract description 28
- 239000013598 vector Substances 0.000 claims description 80
- 238000012549 training Methods 0.000 claims description 56
- 238000012545 processing Methods 0.000 claims description 35
- 230000015654 memory Effects 0.000 claims description 30
- 238000000605 extraction Methods 0.000 claims description 26
- 230000011218 segmentation Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 230000001902 propagating effect Effects 0.000 claims description 6
- 230000009849 deactivation Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/63—Querying
- G06F16/635—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an audio recommendation method, an audio recommendation device and a storage medium, relates to the technical field of computers and aims to improve the accuracy of audio recommendation. In the method, after image feature information capable of describing semantic content of the image information to be dubbed music is determined, feature similarity calculation is carried out on the image feature information and text feature information capable of describing semantic features of audio, and audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user in selecting the music from a plurality of audios irrelevant to the image information to be mixed can be reduced, and the time of the user in mixing the music is saved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to an audio recommendation method, an audio recommendation device, and a storage medium.
Background
With the rise of various instant messaging applications and information sharing platform applications, users can share image information such as pictures and videos on the platforms. When the user shares the image information, the user usually shares the shared image information.
However, in the prior art, when the user is dubbing the shared image information, the user needs to select the audio from the audio library. However, most of the audio automatically recommended to the user in the audio library is current popular music, which does not well meet the current requirements of the user, and thus the accuracy of the recommended audio is still to be improved.
Disclosure of Invention
The application embodiment provides an audio recommendation method, an audio recommendation device and a storage medium, which are used for improving the accuracy of audio recommendation.
In a first aspect, a method for audio recommendation is provided, including:
acquiring image information to be dubbed music; the information of the image to be dubbed music comprises at least one frame of image;
extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;
determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text characteristic information and the audio have a corresponding relation, and the text characteristic information is used for describing semantic characteristics of the audio;
and selecting a specified number of audios from the plurality of audios as recommended audios of the image information to be dubbed according to the feature similarity.
In one embodiment, if the image information to be dubbed music includes a multi-frame image, the performing feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music includes:
extracting the features of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the feature vector of each frame of image;
determining the average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;
performing convolution calculation on the average feature vector and a preset number of first convolution kernels respectively, and adding corresponding offset parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
In one embodiment, a text corresponding to each audio is stored in advance, and the method further includes:
acquiring text characteristic information of each audio according to the following method:
performing word segmentation processing on a text corresponding to an audio frequency, and then performing word-out-of-use processing to obtain each word segmentation in the text corresponding to the audio frequency;
splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence;
embedding each participle in the participle sequence through a word to obtain a word vector of each participle about context connection relation as a text vector of the audio;
performing convolution calculation on the text vectors and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors of specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
In one embodiment, the determining the feature similarity of the image feature information and each of the plurality of text feature information comprises:
and determining the cosine similarity of the image characteristic information and each text characteristic information.
In one embodiment, each convolution parameter and bias parameter of the first convolution kernel and the second convolution kernel are obtained by training:
obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;
determining the feature similarity of the image feature information and the text feature information in each training sample;
determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;
and propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
In a second aspect, an apparatus for audio recommendation is provided, including:
the first acquisition module is used for acquiring image information of the to-be-dubbed music; the information of the image to be dubbed music comprises at least one frame of image;
the characteristic extraction module is used for extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;
a first similarity determining module, configured to determine a feature similarity between the image feature information and each of a plurality of text feature information; the text characteristic information and the audio have a corresponding relation;
and the recommending module is used for selecting a specified number of audios from the plurality of audios as the recommended audios of the image information to be dubbed according to the feature similarity.
In one embodiment, if the information of the image to be dubbed music includes a plurality of frames of images, the feature extraction module includes:
the characteristic extraction unit is used for extracting the characteristics of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the characteristic vector of each frame of image;
the characteristic averaging unit is used for determining an average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;
the calculation unit is used for respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels and adding corresponding bias parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
In one embodiment, a text corresponding to each audio is stored in advance, and the apparatus further includes:
acquiring text characteristic information of each audio according to the following devices:
the processing module is used for performing word segmentation processing on the text corresponding to the audio frequency and then performing word deactivation processing to obtain each word segmentation in the text corresponding to the audio frequency;
the splicing module is used for splicing the participles according to the sequence in the text corresponding to the audio to obtain a participle sequence;
the word embedding module is used for embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio;
the calculation module is used for performing convolution calculation on the text vectors and a preset number of second convolution kernels and adding corresponding offset parameters to obtain feature vectors of the specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
In one embodiment, the first similarity determining module is specifically configured to determine cosine similarity between the image feature information and each text feature information.
In one embodiment, each convolution parameter and bias parameter of the first convolution kernel and the second convolution kernel are obtained by training:
the second acquisition module is used for acquiring a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;
the second similarity determining module is used for determining the feature similarity of the image feature information and the text feature information in each training sample;
the error determining module is used for determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;
and the parameter updating module is used for propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update the convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
In a third aspect, a computing device is provided, comprising at least one processing unit, and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of any of the above-described methods of audio recommendation.
In one embodiment, the computing device may be a server or a terminal device.
In a fourth aspect, a computer-readable medium is provided, which stores a computer program executable by a terminal device, and which, when the program is run on the terminal device, causes the terminal device to perform the steps of any of the above-mentioned methods of audio recommendation.
According to the audio recommendation method, the audio recommendation device and the storage medium, after the image feature information capable of describing the semantic content of the image information to be dubbed music is determined, feature similarity calculation is performed on the image feature information and text feature information capable of describing the semantic feature of the audio, and the audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the audio which is related to the image information to be dubbed and is to be issued by the user is recommended to the user, and the user does not need to manually filter the unrelated audio, so that the operation cost of the user during dubbing music can be reduced, and the time of the user during dubbing music is saved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic view of an application scenario of an audio recommendation method in an embodiment of the present application;
FIG. 2 is a flow chart of the training of a predictive model in an embodiment of the present application;
FIG. 3 is a schematic flowchart of acquiring image feature information according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of a process of acquiring text feature information in the embodiment of the present application;
FIG. 5 is a schematic flow chart of predictive model training in an embodiment of the present application;
FIG. 6 is a schematic flow chart of audio recommendation in an embodiment of the present application;
FIG. 7 is a schematic structural diagram of an apparatus for audio recommendation in an embodiment of the present application;
fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In order to solve the problems that in the prior art, when a user is happy for image information, the operation cost of the user is high due to manual audio selection, and the selected audio type is single due to a small range of the selected audio type, the embodiment of the application provides an audio recommendation method, an audio recommendation device and a storage medium. In order to better understand the technical solution provided by the embodiments of the present application, the following brief description is made on the basic principle of the solution:
taking the example of selecting background music for short videos, currently pushed music is popular music and has no relation with the short videos to be released by users, and music in an audio library needs to be auditioned by the users one by one from a music list. Since the ordering of the audio in the music list is irrelevant to the short video to be released by the user, the user may not necessarily find the appropriate background music after listening to a plurality of audio samples. In view of this, embodiments of the present application provide an audio recommendation method, an apparatus, and a storage medium. In the method, after image feature information capable of describing semantic content of the image information to be dubbed music is determined, feature similarity calculation is carried out on the image feature information and text feature information capable of describing semantic features of audio, and audio of background music capable of being used as the image information to be dubbed music is recommended according to the calculated feature similarity. In this way, by matching the semantics of the image information to be dubbed music and the audio, the audio with the same or similar semantics is recommended to the user, so that the audio recommendation is more accurate; meanwhile, the operation complexity of the user during the music matching can be reduced, namely the user does not need to manually filter irrelevant audio from the audio which is not relevant to the image information to be matched, so that the operation time of the user during the music matching can be saved, and the processing resource consumed by processing the user operation instruction of the irrelevant audio is reduced.
For the convenience of understanding, the technical solutions provided in the present application are further described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an application scenario of an audio recommendation method in an embodiment of the present application. The scenario includes a terminal device 101 of a user 100 and a server 102.
The user 100 wants to match a short video on the terminal device 101, and the short video that needs to be matched can be sent to the server 102 for publishing the short video through an instant messaging tool, an information sharing tool and other applications on the terminal device 101.
The server 102 performs feature extraction on the received short video to obtain image feature information of the short video; carrying out feature similarity calculation on the obtained image feature information and text feature information pre-stored in a database; according to the calculated feature similarity, a specified number of audios are selected, and audio identifiers are sent to applications such as an instant messaging tool and an information sharing tool on the terminal device 101.
And determining the recommended audio by the application such as an instant messaging tool and an information sharing tool on the terminal device 101 according to the audio identifier, and displaying the recommended audio on an operation interface of the terminal device 101.
In one embodiment, the terminal device 101 may be a cell phone, a tablet, a personal computer, a virtual machine, or a simulator simulated terminal device, etc.
In one embodiment, the communication between the terminal device 101 and the server 102 is wireless communication or ethernet communication or the like.
In one embodiment, the above operations may also be performed by an application such as an instant messaging tool, an information sharing tool, etc. on the terminal device 101, that is: processing devices applied by an instant messaging tool, an information sharing tool and the like on the terminal device 101 perform feature extraction on the short video to obtain image feature information of the short video; and sends a request to the server 102 for obtaining text feature information. The server 102 transmits the text feature information to an application such as an instant messenger and an information sharing tool on the terminal apparatus 101 according to the received request.
According to the received text feature information, a processing device of an application such as an instant messaging tool and an information sharing tool on the terminal device 101 calculates feature similarity between the image feature information and the text feature information, selects a specified number of audios as recommended audios according to the calculated feature similarity, and displays the recommended audios on an operation interface of the terminal device 101.
In one embodiment, in order to improve the efficiency of audio recommendation, feature extraction on the short video may also be completed on an instant messaging tool, an information sharing tool, or the like on the terminal device 101, and the extracted image feature information is sent to the server 102, and the server performs feature similarity calculation.
Therefore, in the embodiment of the present application, the scheme may be applied to the terminal device 101, the server 102, or a combination of the terminal device 101 and the server 102, which is not limited in the present application.
In the embodiment of the application, the method for automatically recommending the audio to the user can be divided into two parts, including a training part and a prediction part; the training part is used for training and extracting image characteristic information or text characteristic information, the prediction part is used for predicting which audios are related to the image information to be dubbed, and the training part and the prediction part are respectively explained below.
Firstly, a training part:
in this embodiment of the present application, in order to implement an operation of automatically recommending an audio to a user, a prediction model needs to be established first, and the prediction model is trained through a training sample to obtain an audio that can predict image information, as shown in fig. 2, the method specifically includes the following steps:
step 201: obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample.
Step 202: and determining the feature similarity of the image feature information and the text feature information in each training sample.
The image feature information is used to describe semantic content of the image information, such as: if the display content of the image information is a scene of snowing, the image characteristic information extracted from the image information is used for showing snowing.
The audio frequency is the audio frequency with lyrics, and the lyrics of the audio frequency are used as the text corresponding to the audio frequency; and if the audio has no lyrics, using description information for describing semantic content to be conveyed by the audio as a text corresponding to the audio. The description information can be determined through artificial experience and music knowledge. For example: if a certain audio is the audio without lyrics and the content expressed by the audio is the situation of thinking, the description information of the audio is a sentence related to thinking village.
The text characteristic information is used for describing semantic content of the audio, and the semantic content of the audio is obtained according to lyrics of the audio; if the audio frequency has no lyrics, the audio frequency is obtained according to the description information describing the audio frequency. For example: if the lyrics of the audio are related to snow (e.g. the lyrics are snow in my love you, north of the Seebeck), the text characteristic information is used to indicate that snow is falling.
In the embodiment of the present application, the subjective similarity may be determined through manual experience and knowledge. For example: if the semantic content described by the image feature information of the sample image information is a snow scene, and the semantic content described by the text feature information of the sample text is related to snow (such as snowing, snow watching and the like), determining the subjective similarity as 1; if the semantic content described by the image feature information of the sample image information is a scene of snowing and the semantic content described by the text feature information of the sample text is not related to snow (such as raining, thunder strike and the like), the subjective similarity is determined to be 0. That is, the value range of the subjective similarity can be between 0 and 1, and a larger value indicates a higher subjective similarity, that is, the text and the image information are more relevant. Certainly, for simplicity, the subjective similarity may also have two values, i.e., 0 and 1, where 0 represents no correlation and 1 represents correlation.
Of course, in the embodiment of the present application, in addition to dividing the subjective similarity into two categories (1 or 0), the subjective similarity may be further classified in more detail, for example, the subjective similarity is divided into 3 categories, which are respectively irrelevant (the subjective similarity is determined to be 0), relevant (the subjective similarity is determined to be 0.5), and very relevant (the subjective similarity is determined to be 1). Therefore, through more detailed classification, the correlation degree between the image information and the audio can be obtained, so that the audio which is more in line with the image information is recommended to the user, and the audio recommendation accuracy is improved.
In the embodiment of the application, the cosine value of the included angle between the image characteristic information and the text characteristic information can be calculated, so that the cosine similarity of the image characteristic information and the text characteristic information can be determined.
Step 203: and determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity.
Step 204: and propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
The first convolution kernel is a convolution kernel used for obtaining image characteristic information, and the second convolution kernel is a convolution kernel used for obtaining text characteristic information.
Among them, the back propagation algorithm (Backpropagation algorithm) is also called as BP algorithm, and its role is to reduce errors; namely, the convolution parameters in the first convolution kernel and the second convolution kernel and the corresponding offset parameters are updated, so that the obtained error is smaller, and the model and the requirements are met.
Therefore, the prediction model is trained, so that the prediction model can better predict the image information, find the audio more suitable for the image information and recommend the audio to the user, and the audio recommendation is more accurate; meanwhile, the operation cost of the user during the music matching can be reduced, and the time of the user during the music matching is saved; in addition, since the audio selected through the prediction model is selected among all the audios in the audio library, the type of the selected audio is more diversified than the automatically recommended audio type, thereby increasing the type of the selected audio.
As described above, in the embodiment of the present application, the training samples include image feature information and text feature information, and how to obtain the image feature information and the text feature information is described in detail below.
(1) Image characteristic information:
in the embodiment of the application, if image characteristic information is to be acquired, firstly, image information needs to be acquired; the image information may be video information, picture information, or image information composed of multiple pictures. After the image information is acquired, feature extraction is performed on the image information to obtain image feature information, which may be specifically implemented as steps a 1-A3:
step A1: and extracting the features of each frame of image in the sample image information based on a residual error network model to obtain the feature vector of each frame of image.
Wherein, the sample image information comprises a plurality of frame images.
In the embodiment of the application, the image is subjected to feature extraction by using a residual error network (ResNet) model. The residual network model is a Convolutional Neural Network (CNN) model, and is easy to optimize, and the accuracy of the extracted image feature information can be improved by adding a considerable amount of depth. The core of the method is to solve the side effect (degradation problem) caused by increasing the depth in the existing convolutional neural network model, so that the network performance can be improved by simply increasing the network depth.
In addition, in the embodiment of the application, the image information may be subjected to feature extraction by using other convolutional Neural network models, and the image information may also be subjected to feature extraction by using network models such as a Recurrent Neural Network (RNN) and a Long Short-Term Memory network (LSTM). The present embodiment is applicable as long as semantic features of image information can be extracted.
Step A2: and determining the average characteristic vector of the sample image information according to the characteristic vector of each frame of image.
If the sample image information only has one picture, the average feature vector of the sample image information can be obtained by directly extracting the features of the picture.
In one embodiment, the feature vectors for each frame of image may be added and averaged to obtain an average feature vector for the sample image information.
In one embodiment, the feature vectors of each frame of image may be weighted and summed, and then averaged to obtain the average feature vector of the sample image information.
Step A3: performing convolution calculation on the average feature vector and a preset number of first convolution kernels respectively, and adding corresponding offset parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
The convolution parameters and the bias parameters of the first convolution kernel can be corrected by the error obtained in step 203, so that the obtained image feature information more conforms to the semantic content described by the sample image.
The number of the first convolution kernels can be set according to actual conditions, such as 512 kernels, 1000 kernels, and the like. If the number of the first convolution kernels is 512, a 512-dimensional feature vector is obtained.
Therefore, the image information is subjected to feature extraction through the neural network model, the semantic content of the image information can be more accurately described by the obtained image feature information, and audio which meets requirements can be recommended for the image information.
Fig. 3 is a schematic flow chart illustrating a process of acquiring image feature information. After each frame of image of the image information is obtained (the image frame represents the image of each frame in the image information), feature extraction is performed on each frame of image through the residual error network model 301, and a feature vector of each frame of image is obtained; carrying out average processing on the feature vector of each frame of image to obtain an average feature vector of image information; the average feature vector is input into the fully-connected layer 302 for calculation (wherein the fully-connected layer represents what is stated in step a3, that is, in the fully-connected layer, the average feature vector of the image information and the first convolution kernel of the fully-connected layer are respectively subjected to convolution calculation and corresponding bias parameters are added), and image feature information of the image information is obtained.
(2) Text characteristic information:
in the embodiment of the present application, if text feature information is to be acquired, a text corresponding to an audio needs to be acquired. After the text corresponding to the audio is obtained, performing feature extraction on the text to obtain text feature information of the audio, which may be specifically implemented as steps B1-B4:
step B1: and after word segmentation processing is carried out on the text corresponding to the audio, word-stop-removing processing is carried out to obtain each word segmentation in the text corresponding to the audio.
For example, if the text is "i prefer to do a trivial person", after performing the word segmentation process, the result is "i/prefer/do/one/trivial/person"; wherein, if 'would' and 'one' are stop words, the result of the participle processing is processed by stop words, and the obtained participles are 'I/doing/trivial/people'.
Step B2: and splicing the word segmentation according to the sequence in the text corresponding to the audio to obtain a word segmentation sequence.
The purpose of splicing according to the sequence in the text is to acquire the context connection relationship in the text. As described above, the word segmentation sequence obtained by splicing according to the text sequence is "i do trivial man".
Step B3: and embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio.
Word embedding may be a general term of language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it can be a high dimensional space with dimensions of the number of all words embedded into a much lower dimensional continuous vector space, each word or phrase being mapped as a vector on the real number domain; word embedding may be done over word2vec (word vector).
In the embodiment of the application, after word embedding is carried out on each participle, a word vector of each participle is obtained, and then feature extraction is carried out on the word vector of each participle to obtain a text vector of a text.
Step B4: performing convolution calculation on the text vectors and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors of specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
The convolution parameters and bias parameters of the second convolution kernel can be corrected by the error obtained in step 203, so that the obtained text feature information more conforms to the semantic content of the text of the audio. The number of the second convolution kernels can also be set according to actual conditions, and the number of the dimensionalities of the obtained feature vectors is the same as the number of the second convolution kernels.
The text feature information and the audio have a corresponding relation, and the text feature information is used for describing semantic features of the audio. And the text feature information and the audio frequency can be in one-to-one correspondence or in one-to-many correspondence. For example: if two audio frequencies without lyrics are expressed in the country of thinking, the corresponding texts may be the same texts, and therefore, the corresponding text feature information is also the same.
Therefore, the text corresponding to the audio is subjected to feature extraction through the word vector model, so that the semantic content of the text described by the obtained text feature information is more accurate, and the audio which meets the requirements can be recommended for the image information.
In one embodiment, the text of the audio with the lyrics covers richer content than the text of the audio without the lyrics, so that the text characteristic information extracted based on the lyrics is more accurate and can more comprehensively describe the information expressed by the audio.
Fig. 4 is a schematic flowchart of a process for acquiring text feature information. After the audio text is obtained, performing word segmentation processing and word deactivation processing on the audio text to obtain each word segmentation in the audio text; outputting each participle of the audio text to a word vector model 401 for calculation (wherein the word vector model represents the content in step B3, that is, performing a word embedding operation on each participle in the audio text), so as to obtain a word vector of each participle; extracting the characteristics of the word vectors of the word segments, and reducing the dimension to obtain the characteristic vectors of the audio text; inputting the feature vector of the audio text into the fully-connected layer 402 (wherein the fully-connected layer represents what is stated in step B4, that is, in the fully-connected layer, the feature vector of the text and the second convolution kernel of the fully-connected layer are respectively subjected to convolution calculation and corresponding bias parameters are added), so as to obtain text feature information of the audio text.
Note that the fully connected layer 302 in fig. 3 and the fully connected layer 402 in fig. 4 are two fully connected layers.
After the image characteristic information and the text characteristic information are obtained, inputting the image characteristic information and the text characteristic information in the training sample into a prediction model for training so as to obtain the prediction model meeting the requirements. Fig. 5 is a schematic flow chart of the prediction model training. Wherein, the image information is processed through the flow shown in fig. 3 to obtain image characteristic information; the text corresponding to the audio frequency is processed through the flow shown in fig. 4 to obtain text characteristic information; similarity calculation is carried out on the obtained image characteristic information and the text characteristic information to obtain characteristic similarity; calculating the difference between the feature similarity and the subjective similarity of the training sample, and determining the error between the feature similarity and the subjective similarity; after the error is determined, judging whether the error is smaller than a preset error or not; if the error is not less than the preset error, inputting the error into the two full-connected layers (namely the full-connected layer 302 and the full-connected layer 402) through a back propagation algorithm to update parameters in the two full-connected layers so as to obtain new image characteristic information and new text characteristic information; and if the error is smaller than the preset error, determining that the training of the prediction model is finished, and stopping the training.
II, a prediction part:
in the embodiment of the present application, after the prediction model is trained, audio recommendation may be performed on image information of a piece of music through the prediction model, as shown in fig. 6, the method specifically includes the following steps:
step 601: acquiring image information to be dubbed music; wherein, the image information to be dubbed music comprises at least one frame of image.
Step 602: and performing feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music.
Step 603: determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text feature information and the audio have a corresponding relation, and the text feature information is used for describing semantic features of the audio.
The text feature information can be obtained by performing feature extraction on the text of the audio in the audio library when the audio is recommended to the image information of the soundtrack. Or extracting the characteristics of the text corresponding to the audio in the audio library in advance and storing the extracted characteristics in the database; in this way, when using text feature information, it can be directly retrieved from the database.
Step 604: and selecting a specified number of audios from the plurality of audios as recommended audios of the image information to be dubbed according to the feature similarity.
According to the size of the feature similarity, a specified number of audios can be selected as the recommended audios of the image information to be dubbed according to the sequence from high to low, and the specified number of audios can be randomly selected as the recommended audios of the image information to be dubbed in audio meeting the requirement. The audio meeting the requirement is the audio corresponding to the characteristic similarity larger than the preset similarity.
Therefore, through intelligent matching of the image feature information of the image information to be dubbed music and the text feature information of the semantic features of the audio, the operation cost of a user during dubbing music can be reduced, and the selected audio type can be increased.
In the embodiment of the application, when actual operation is performed, the part of the prediction model for acquiring the text feature information is not used, but the text of the audio is subjected to feature extraction in advance to obtain the text feature information and stored. When the prediction model needs to use the text feature information, the text feature information can be directly obtained from the stored text feature information.
Based on the same inventive concept, the embodiment of the application also provides an audio recommendation device. As shown in fig. 7, the apparatus includes:
a first obtaining module 701, configured to obtain image information of a to-be-dubbed music; the information of the image to be dubbed music comprises at least one frame of image;
a feature extraction module 702, configured to perform feature extraction on the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music;
a first similarity determining module 703, configured to determine a feature similarity between the image feature information and each text feature information in the multiple text feature information; the text characteristic information and the audio have a corresponding relation;
a recommending module 704, configured to select a specified number of audios from the multiple audios as recommended audios of the image information to be dubbed according to the feature similarity.
Further, if the information of the image to be dubbed music includes multiple frames of images, the feature extraction module 702 includes:
the characteristic extraction unit is used for extracting the characteristics of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the characteristic vector of each frame of image;
the characteristic averaging unit is used for determining an average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;
the calculation unit is used for respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels and adding corresponding bias parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
Further, a text corresponding to each audio is stored in advance, and the apparatus further includes:
acquiring text characteristic information of each audio according to the following devices:
the processing module is used for performing word segmentation processing on the text corresponding to the audio frequency and then performing word deactivation processing to obtain each word segmentation in the text corresponding to the audio frequency;
the splicing module is used for splicing the participles according to the sequence in the text corresponding to the audio to obtain a participle sequence;
the word embedding module is used for embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio;
the calculation module is used for performing convolution calculation on the text vectors and a preset number of second convolution kernels and adding corresponding offset parameters to obtain feature vectors of the specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
Further, the first similarity determining module 703 is specifically configured to determine cosine similarity between the image feature information and each text feature information.
Further, the convolution parameters and the bias parameters of the first convolution kernel and the second convolution kernel are obtained by training through the following devices:
the second acquisition module is used for acquiring a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;
the second similarity determining module is used for determining the feature similarity of the image feature information and the text feature information in each training sample;
the error determining module is used for determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;
and the parameter updating module is used for propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update the convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
Based on the same technical concept, the present application further provides a terminal device 800, referring to fig. 8, where the terminal device 800 is configured to implement the methods described in the above various method embodiments, for example, implement the embodiment shown in fig. 6, and the terminal device 800 may include a memory 801, a processor 802, an input unit 4503, and a display panel 804.
A memory 801 for storing computer programs executed by the processor 802. The memory 801 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal apparatus 800, and the like. The processor 802 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The input unit 803 may be used to acquire a user instruction input by a user. The display panel 804 is configured to display information input by a user or information provided to the user, and in this embodiment of the present application, the display panel 804 is mainly used to display a display interface of each application program in the terminal device and a control entity displayed in each display interface. Alternatively, the display panel 804 may be configured in the form of a Liquid Crystal Display (LCD) or an organic light-emitting diode (OLED), and the like.
The embodiment of the present application does not limit the specific connection medium among the memory 801, the processor 802, the input unit 803, and the display panel 804. In the embodiment of the present application, the memory 801, the processor 802, the input unit 803, and the display panel 804 are connected by the bus 805 in fig. 8, the bus 805 is represented by a thick line in fig. 8, and the connection manner between other components is merely illustrative and not limited. The bus 805 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The memory 801 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 801 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 801 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 801 may be a combination of the above memories.
The processor 802, configured to implement the embodiment shown in fig. 6, includes:
a processor 802 for invoking a computer program stored in the memory 801 to perform the embodiment as shown in fig. 6.
The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions required to be executed by the processor, and includes a program required to be executed by the processor.
In some possible embodiments, aspects of a method for audio recommendation provided by the present application may also be implemented in the form of a program product, which includes program code for causing a terminal device to perform the steps of a method for audio recommendation according to various exemplary embodiments of the present application described above in this specification, when the program product is run on the terminal device. For example, the terminal device may perform the embodiment as shown in fig. 6.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product for audio recommendation of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including a physical programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
Claims (12)
1. An audio recommendation method, characterized in that the method comprises:
acquiring image information to be dubbed music; the information of the image to be dubbed music comprises at least one frame of image;
extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;
determining feature similarity of the image feature information and each text feature information in a plurality of text feature information; the text characteristic information and the audio have a corresponding relation;
and selecting a specified number of audios from the plurality of audios as recommended audios of the image information to be dubbed according to the feature similarity.
2. The method according to claim 1, wherein if the image information to be dubbed music includes a plurality of frames of images, the extracting the features of the image information to be dubbed music to obtain image feature information capable of describing semantic content of the image information to be dubbed music comprises:
extracting the features of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the feature vector of each frame of image;
determining the average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;
performing convolution calculation on the average feature vector and a preset number of first convolution kernels respectively, and adding corresponding offset parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
3. The method of claim 2, wherein a text corresponding to each audio is pre-stored, the method further comprising:
acquiring text characteristic information of each audio according to the following method:
performing word segmentation processing on a text corresponding to an audio frequency, and then performing word-out-of-use processing to obtain each word segmentation in the text corresponding to the audio frequency;
splicing the word segmentation according to the sequence in the text corresponding to the audio frequency to obtain a word segmentation sequence;
embedding each participle in the participle sequence through a word to obtain a word vector of each participle about context connection relation as a text vector of the audio;
performing convolution calculation on the text vectors and a preset number of second convolution kernels, and adding corresponding offset parameters to obtain feature vectors of specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
4. The method of claim 1, wherein the determining a feature similarity of the image feature information and each of a plurality of text feature information comprises:
and determining the cosine similarity of the image characteristic information and each text characteristic information.
5. The method of claim 3, wherein each convolution parameter of the first convolution kernel and the second convolution kernel and a bias parameter are obtained by training:
obtaining a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;
determining the feature similarity of the image feature information and the text feature information in each training sample;
determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;
and propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
6. An audio recommendation apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring image information of the to-be-dubbed music; the information of the image to be dubbed music comprises at least one frame of image;
the characteristic extraction module is used for extracting the characteristics of the image information to be dubbed music to obtain image characteristic information capable of describing the semantic content of the image information to be dubbed music;
a first similarity determining module, configured to determine a feature similarity between the image feature information and each of a plurality of text feature information; the text characteristic information and the audio have a corresponding relation;
and the recommending module is used for selecting a specified number of audios from the plurality of audios as the recommended audios of the image information to be dubbed according to the feature similarity.
7. The apparatus of claim 6, wherein if the image information of the to-be-dubbed music comprises a plurality of frames of images, the feature extraction module comprises:
the characteristic extraction unit is used for extracting the characteristics of each frame of image in the image information to be dubbed music based on a residual error network model to obtain the characteristic vector of each frame of image;
the characteristic averaging unit is used for determining an average characteristic vector of the image information to be matched according to the characteristic vector of each frame of image;
the calculation unit is used for respectively carrying out convolution calculation on the average feature vector and a preset number of first convolution kernels and adding corresponding bias parameters to obtain a feature vector with a specified dimensionality as the image feature information; wherein the convolution parameters of different first convolution kernels are not identical.
8. The apparatus of claim 7, wherein a text corresponding to each audio is pre-stored, the apparatus further comprising:
acquiring text characteristic information of each audio according to the following devices:
the processing module is used for performing word segmentation processing on the text corresponding to the audio frequency and then performing word deactivation processing to obtain each word segmentation in the text corresponding to the audio frequency;
the splicing module is used for splicing the participles according to the sequence in the text corresponding to the audio to obtain a participle sequence;
the word embedding module is used for embedding each participle in the participle sequence through words to obtain a word vector of each participle about context connection relation as a text vector of the audio;
the calculation module is used for performing convolution calculation on the text vectors and a preset number of second convolution kernels and adding corresponding offset parameters to obtain feature vectors of the specified dimensionality of the audio frequency as text feature information; and the convolution parameters of different second convolution kernels are not completely the same, and the number of the first convolution kernels is the same as that of the second convolution kernels.
9. The apparatus of claim 6, wherein the first similarity module is specifically configured to determine a cosine similarity of the image feature information and each text feature information.
10. The apparatus of claim 8, wherein each convolution parameter of the first convolution kernel and the second convolution kernel and a bias parameter are obtained by training with:
the second acquisition module is used for acquiring a training sample; each training sample comprises image characteristic information of sample image information and text characteristic information of a sample text, and the label of the training sample is the subjective similarity between the image characteristic information and the text characteristic information in the training sample;
the second similarity determining module is used for determining the feature similarity of the image feature information and the text feature information in each training sample;
the error determining module is used for determining the error between the feature similarity corresponding to each training sample and the corresponding subjective similarity;
and the parameter updating module is used for propagating the error to the first convolution kernel and the second convolution kernel through a back propagation algorithm so as to update the convolution parameters in the first convolution kernel and the second convolution kernel and corresponding offset parameters to obtain the convolution parameters and the offset parameters meeting specified conditions.
11. A computing device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 5.
12. A computer-readable medium, in which a computer program executable by a terminal device is stored, which program, when run on the terminal device, causes the terminal device to carry out the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910604904.8A CN112182281B (en) | 2019-07-05 | 2019-07-05 | Audio recommendation method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910604904.8A CN112182281B (en) | 2019-07-05 | 2019-07-05 | Audio recommendation method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112182281A true CN112182281A (en) | 2021-01-05 |
CN112182281B CN112182281B (en) | 2023-09-19 |
Family
ID=73915959
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910604904.8A Active CN112182281B (en) | 2019-07-05 | 2019-07-05 | Audio recommendation method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112182281B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377971A (en) * | 2021-05-31 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN113836343A (en) * | 2021-09-14 | 2021-12-24 | 深圳Tcl新技术有限公司 | Audio recommendation method and device, electronic equipment and storage medium |
CN114022822A (en) * | 2021-11-15 | 2022-02-08 | 北京达佳互联信息技术有限公司 | Video score obtaining method and device, electronic equipment and computer readable medium |
CN114302167A (en) * | 2021-09-01 | 2022-04-08 | 腾讯科技(深圳)有限公司 | Video dubbing method, device, equipment and computer readable storage medium |
CN115662467A (en) * | 2022-09-30 | 2023-01-31 | 恩平市炫音电子科技有限公司 | Music intelligent playing control system and method based on big data |
CN114022822B (en) * | 2021-11-15 | 2025-04-11 | 北京达佳互联信息技术有限公司 | Method, device, electronic device and computer-readable medium for obtaining video soundtrack |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100050064A1 (en) * | 2008-08-22 | 2010-02-25 | At & T Labs, Inc. | System and method for selecting a multimedia presentation to accompany text |
US8737817B1 (en) * | 2011-02-08 | 2014-05-27 | Google Inc. | Music soundtrack recommendation engine for videos |
CN103927366A (en) * | 2014-04-21 | 2014-07-16 | 苏州大学 | Method and system for automatically playing songs according to pictures |
KR20140137030A (en) * | 2013-05-21 | 2014-12-02 | 삼성전자주식회사 | Method and apparatus for providing information using messenger |
CN108319686A (en) * | 2018-02-01 | 2018-07-24 | 北京大学深圳研究生院 | Antagonism cross-media retrieval method based on limited text space |
CN108647350A (en) * | 2018-05-16 | 2018-10-12 | 中国人民解放军陆军工程大学 | Image-text associated retrieval method based on two-channel network |
CN109063163A (en) * | 2018-08-14 | 2018-12-21 | 腾讯科技(深圳)有限公司 | A kind of method, apparatus, terminal device and medium that music is recommended |
CN109783655A (en) * | 2018-12-07 | 2019-05-21 | 西安电子科技大学 | A kind of cross-module state search method, device, computer equipment and storage medium |
US20190205393A1 (en) * | 2016-07-11 | 2019-07-04 | Peking University Shenzhen Graduate School | A cross-media search method |
-
2019
- 2019-07-05 CN CN201910604904.8A patent/CN112182281B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100050064A1 (en) * | 2008-08-22 | 2010-02-25 | At & T Labs, Inc. | System and method for selecting a multimedia presentation to accompany text |
US8737817B1 (en) * | 2011-02-08 | 2014-05-27 | Google Inc. | Music soundtrack recommendation engine for videos |
KR20140137030A (en) * | 2013-05-21 | 2014-12-02 | 삼성전자주식회사 | Method and apparatus for providing information using messenger |
CN103927366A (en) * | 2014-04-21 | 2014-07-16 | 苏州大学 | Method and system for automatically playing songs according to pictures |
US20190205393A1 (en) * | 2016-07-11 | 2019-07-04 | Peking University Shenzhen Graduate School | A cross-media search method |
CN108319686A (en) * | 2018-02-01 | 2018-07-24 | 北京大学深圳研究生院 | Antagonism cross-media retrieval method based on limited text space |
CN108647350A (en) * | 2018-05-16 | 2018-10-12 | 中国人民解放军陆军工程大学 | Image-text associated retrieval method based on two-channel network |
CN109063163A (en) * | 2018-08-14 | 2018-12-21 | 腾讯科技(深圳)有限公司 | A kind of method, apparatus, terminal device and medium that music is recommended |
CN109783655A (en) * | 2018-12-07 | 2019-05-21 | 西安电子科技大学 | A kind of cross-module state search method, device, computer equipment and storage medium |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377971A (en) * | 2021-05-31 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN113377971B (en) * | 2021-05-31 | 2024-02-27 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN114302167A (en) * | 2021-09-01 | 2022-04-08 | 腾讯科技(深圳)有限公司 | Video dubbing method, device, equipment and computer readable storage medium |
CN114302167B (en) * | 2021-09-01 | 2025-03-04 | 腾讯科技(深圳)有限公司 | Video music matching method, device, equipment and computer readable storage medium |
CN113836343A (en) * | 2021-09-14 | 2021-12-24 | 深圳Tcl新技术有限公司 | Audio recommendation method and device, electronic equipment and storage medium |
CN114022822A (en) * | 2021-11-15 | 2022-02-08 | 北京达佳互联信息技术有限公司 | Video score obtaining method and device, electronic equipment and computer readable medium |
CN114022822B (en) * | 2021-11-15 | 2025-04-11 | 北京达佳互联信息技术有限公司 | Method, device, electronic device and computer-readable medium for obtaining video soundtrack |
CN115662467A (en) * | 2022-09-30 | 2023-01-31 | 恩平市炫音电子科技有限公司 | Music intelligent playing control system and method based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN112182281B (en) | 2023-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102155261B1 (en) | Wide and deep machine learning models | |
AU2014201827B2 (en) | Scoring concept terms using a deep network | |
CN107577807B (en) | Method and device for pushing information | |
US20200401896A1 (en) | Generating larger neural networks | |
CN112182281B (en) | Audio recommendation method, device and storage medium | |
US20200322570A1 (en) | Method and apparatus for aligning paragraph and video | |
JP7652916B2 (en) | Method and apparatus for pushing information - Patents.com | |
CN111897950A (en) | Method and apparatus for generating information | |
CN110766513B (en) | Information ordering method, device, electronic equipment and readable storage medium | |
CN112836128A (en) | Information recommendation method, apparatus, device and storage medium | |
CN112650942A (en) | Product recommendation method, device, computer system and computer-readable storage medium | |
CN112269942B (en) | Method, device and system for recommending object and electronic equipment | |
CN113395538A (en) | Sound effect rendering method and device, computer readable medium and electronic equipment | |
CN113836404B (en) | Object recommendation method, device, electronic equipment and computer readable storage medium | |
CN113051381B (en) | Information quality inspection method, information quality inspection device, computer system and computer readable storage medium | |
US10579752B2 (en) | Generating a model based on input | |
US20220245460A1 (en) | Adaptive self-adversarial negative sampling for graph neural network training | |
CN116911954B (en) | Method and device for recommending items based on interests and popularity | |
CN111507471A (en) | Model training method, device, equipment and storage medium | |
WO2024152686A1 (en) | Method and apparatus for determining recommendation index of resource information, device, storage medium and computer program product | |
CN113297418A (en) | Project prediction and recommendation method, device and system | |
CN113742593B (en) | Method and device for pushing information | |
CN113849735B (en) | A game recommendation method and system based on hybrid model | |
CN113220922A (en) | Image searching method and device and electronic equipment | |
CN110942306A (en) | Data processing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |