CN118227910B - Media resource aggregation method, device, equipment and storage medium - Google Patents
Media resource aggregation method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN118227910B CN118227910B CN202410635095.8A CN202410635095A CN118227910B CN 118227910 B CN118227910 B CN 118227910B CN 202410635095 A CN202410635095 A CN 202410635095A CN 118227910 B CN118227910 B CN 118227910B
- Authority
- CN
- China
- Prior art keywords
- resource
- information
- media
- feature
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004220 aggregation Methods 0.000 title claims abstract description 92
- 230000002776 aggregation Effects 0.000 title claims abstract description 88
- 238000000034 method Methods 0.000 title claims abstract description 76
- 230000004927 fusion Effects 0.000 claims abstract description 162
- 238000000605 extraction Methods 0.000 claims abstract description 69
- 230000004931 aggregating effect Effects 0.000 claims abstract description 8
- 238000012163 sequencing technique Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims description 33
- 238000013145 classification model Methods 0.000 claims description 21
- 230000015654 memory Effects 0.000 claims description 21
- 238000007499 fusion processing Methods 0.000 claims description 17
- 238000004458 analytical method Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 238000011161 development Methods 0.000 claims description 11
- 239000012634 fragment Substances 0.000 claims description 7
- 238000010219 correlation analysis Methods 0.000 claims description 4
- 238000010801 machine learning Methods 0.000 abstract description 17
- 238000005516 engineering process Methods 0.000 description 32
- 238000013473 artificial intelligence Methods 0.000 description 21
- 230000006870 function Effects 0.000 description 20
- 238000012549 training Methods 0.000 description 17
- 238000003058 natural language processing Methods 0.000 description 13
- 230000014509 gene expression Effects 0.000 description 11
- 238000013507 mapping Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 230000004913 activation Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 5
- 238000002372 labelling Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000004138 cluster model Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 241000282376 Panthera tigris Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/44—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a media resource aggregation method, a device, equipment and a storage medium, which relate to the technical field of machine learning and comprise the following steps: performing feature extraction based on resource name information, resource serial number information and release attribute information included in the multi-modal information of each media resource in the plurality of media resources to obtain multi-modal feature information of each media resource; inputting the multi-mode feature information and the resource serial number feature information into a resource feature fusion model, and guiding the multi-mode feature information and the resource serial number feature information to carry out semantic fusion based on feature related information among the multi-mode feature information to obtain semantic fusion feature information of each media resource; aggregating the plurality of media resources based on the semantic fusion characteristic information to determine a plurality of media resource aggregate sets; and generating a media resource sequence which corresponds to each media resource collection and carries collection name information. The scheme of the application can improve the accuracy of resource sequencing on the basis of improving the accuracy of resource aggregation.
Description
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for aggregating media resources.
Background
With the development of computer technology and network technology, various types of media resource platforms appear, and users can browse media resources such as videos, audio books or cartoon on the media resource platforms according to actual needs. In order to meet the continuous consumption requirement of users when browsing resources, a plurality of media resources with plot relevance are usually required to be displayed in a plot-coherent order in a media resource platform.
However, in the related art, it is generally determined whether the resources are similar two by two, and then all similar resources are aggregated, so that not only is the aggregation efficiency and accuracy low, but also the plot consistency inside the resource aggregate is difficult to ensure.
Disclosure of Invention
The application provides a media resource aggregation method, a device, equipment and a storage medium, which can improve the accuracy of media resource sequencing on the basis of improving the accuracy and efficiency of media resource aggregation, thereby improving the plot consistency of a media resource sequence and ensuring the browsing experience of a resource request object, and the technical scheme of the application is as follows:
in one aspect, a method for aggregating media resources is provided, the method comprising:
acquiring multi-mode information corresponding to each media resource in a plurality of media resources of a target resource type, wherein the multi-mode information comprises: resource name information, resource serial number information and release attribute information;
Performing feature extraction on each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource;
Inputting the multi-modal feature information and the resource serial number feature information corresponding to the resource serial number information into a resource feature fusion model, and guiding the multi-modal feature information and the resource serial number feature information to carry out semantic fusion based on feature related information among feature elements in the multi-modal feature information to obtain semantic fusion feature information corresponding to each media resource;
based on the semantic fusion characteristic information corresponding to each media resource, carrying out resource aggregation on the plurality of media resources to determine a plurality of media resource sets;
Based on the multi-mode information corresponding to the media resources in each media resource collection, the media resources in each media resource collection are ordered, and a media resource sequence corresponding to each media resource collection is generated, wherein the media resource sequence carries collection name information of the corresponding media resource collection.
In another aspect, there is provided a media resource aggregation apparatus, the apparatus comprising:
The multi-mode information acquisition module is used for acquiring multi-mode information corresponding to each media resource in a plurality of media resources of a target resource type, and the multi-mode information comprises: resource name information, resource serial number information and release attribute information;
the feature extraction module is used for extracting features of each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource;
The semantic fusion module is used for inputting the multi-modal feature information and the resource serial number feature information corresponding to the resource serial number information into a resource feature fusion model, and guiding the multi-modal feature information and the resource serial number feature information to carry out semantic fusion based on feature related information among feature elements in the multi-modal feature information to obtain semantic fusion feature information corresponding to each media resource;
The resource aggregation module is used for carrying out resource aggregation on the plurality of media resources based on the semantic fusion characteristic information corresponding to each media resource and determining a plurality of media resource sets;
The sequence generation module is used for sequencing the media resources in each media resource collection based on the multi-mode information corresponding to the media resources in each media resource collection, and generating a media resource sequence corresponding to each media resource collection, wherein the media resource sequence carries collection name information of the corresponding media resource collection.
In another aspect, a media asset aggregation apparatus is provided, the apparatus comprising a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement a media asset aggregation method as described above.
In another aspect, a computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement a media asset aggregation method as described above is provided.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the media asset aggregation method as described above.
The media resource aggregation method, the device, the equipment and the storage medium provided by the application have the following technical effects:
According to the method, the multi-modal information corresponding to each media resource in the target resource type is obtained in an artificial intelligence mode, the feature extraction is carried out on each media resource based on the resource name information, the resource serial number information and the release attribute information in the multi-modal information to obtain the multi-modal feature information corresponding to each media resource, then the semantic fusion is carried out on the multi-modal feature information and the resource serial number feature information corresponding to the resource serial number information based on the feature related information among feature elements in the multi-modal feature information to obtain the semantic fusion feature information corresponding to each media resource, the accuracy of the semantic fusion feature information in the deep semantic information representation of the media resource can be improved, then the resource aggregation is carried out on the plurality of media resources based on the semantic fusion feature information corresponding to each media resource, a plurality of media resource sets are determined, the media resources in each media resource set are ordered based on the multi-modal information corresponding to each media resource set, the media resource sequence carrying the set name information corresponding to each media resource set can be generated, the accuracy of the media resource set and the accuracy of the media resource set aggregation and the accuracy of the media sequence carrying the media set name information can be improved, and the accuracy of the media resource sequence can be guaranteed, and the request of the media sequence can be guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;
FIG. 2 is a schematic flow chart of a media resource aggregation method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a multi-modal feature information extraction scheme according to an embodiment of the present application;
FIG. 4 is a flowchart of another embodiment of a multi-modal feature information extraction scheme according to the present application;
FIG. 5 is a schematic flow diagram of a process for inputting target level input information and resource serial number feature information into a target level fusion module, guiding the target level input information and the resource serial number feature information to perform cross fusion processing based on target feature related information among feature elements in the target level input information, and obtaining target level output information;
FIGS. 6a-6e are schematic diagrams of a model structure provided by embodiments of the present application;
FIG. 7 is a flowchart of another media resource aggregation method according to an embodiment of the present application;
fig. 8 is a schematic flow chart of a scheme for determining the name information of a collection according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a media asset aggregation system according to an embodiment of the present application;
FIG. 10 is a flowchart of a resource media sequence display scheme according to an embodiment of the present application;
FIG. 11 is a block diagram of a media asset aggregation device according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of a media resource aggregation device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server comprising a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It will be appreciated that in the specific embodiments of the present application, related data such as user information is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
To facilitate an understanding of embodiments of the present application, several concepts will be briefly described as follows:
artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important transformation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Key technologies to speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings reform for the development of the voice technology, and the pre-training models such as WavLM, uniSpeech and the like which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.
With the research and advancement of artificial intelligence technology, the research and application of artificial intelligence technology is developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial Intelligence Generation Content (AIGC), conversational interactions, smart medical treatment, smart customer service, game AI, etc., and it is believed that with the development of technology, the artificial intelligence technology will be applied in more fields and with increasing importance value.
In addition, technical terms related to the present application include:
Regular expressions, which are also known as regular expressions, are a type of text pattern that includes common characters (e.g., letters between a and z) and special characters (called "meta-characters") and are a concept of computer science. Regular expressions use a single string to describe, match a series of strings that match a certain syntactic rule, and are typically used to retrieve, replace, text that meets a certain pattern (rule).
The scheme provided by the embodiment of the application relates to artificial intelligence natural language processing, machine learning and other technologies, and is specifically described by the following embodiments:
The media resource aggregation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1, wherein the application environment can comprise a client 10 and a server 20, and the client 10 and the server 20 can be indirectly connected in a wireless communication mode. The related object (such as a user) may send a resource aggregation request for a plurality of media resources of a target resource type to the server side 20 through the client side 10, and the server side 20 may obtain, in response to the resource aggregation request, multi-modal information corresponding to each media resource in the plurality of media resources, where the multi-modal information corresponding to each media resource may include: the method comprises the steps of carrying out feature extraction on each media resource based on resource name information, resource sequence number information and release attribute information to obtain multi-mode feature information corresponding to each media resource, inputting the multi-mode feature information corresponding to the multi-mode feature information and the resource sequence number information into a resource feature fusion model, carrying out semantic fusion on the multi-mode feature information and the resource sequence number feature information based on feature related information among feature elements in the multi-mode feature information to obtain semantic fusion feature information corresponding to each media resource, carrying out resource aggregation on a plurality of media resources based on the semantic fusion feature information corresponding to each media resource to determine a plurality of media resource assemblies, finally sequencing the media resources in each media resource assembly based on the multi-mode information corresponding to the media resources in each media resource assembly to generate a media resource sequence carrying the assembly name information corresponding to each media resource assembly, and feeding back the media resource sequence to a client 10. It should be noted that fig. 1 is only an example.
The client may be a smart phone, a computer (such as a desktop computer, a tablet computer, a notebook computer), a digital assistant, an intelligent voice interaction device (such as an intelligent sound box), an intelligent wearable device, an on-board terminal, or other type of entity device, or may be software running in the entity device, such as a computer program. The operating system corresponding to the first client may be an Android system, an iOS system (which is a mobile operating system developed by apple corporation), a Linux system (an operating system), a Microsoft Windows system (microsoft windows operating system), and the like.
The server side can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms and the like. Wherein the server may comprise a network communication unit, a processor, a memory, etc. The server side can provide background services for the corresponding client side.
The client 10 and the server 20 may be used to construct a system related to the aggregation of media resources, which may be a distributed system.
It should be noted that, the media resource aggregation method provided by the present application may be applied to a client or a server, and is not limited to the above embodiment of the application environment.
In the following, a specific embodiment of a media resource aggregation method provided by the present application is described, and fig. 2 is a schematic flow chart of a media resource aggregation method provided by the embodiment of the present application, where the method operation steps described in the examples or the flow chart are provided, but more or fewer operation steps may be included based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. In actual system or product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment). Specifically, as shown in fig. 2, the method may include:
s201, obtaining multi-mode information corresponding to each media resource in a plurality of media resources of a target resource type, wherein the multi-mode information comprises: resource name information, resource sequence number information, and release attribute information.
In the present description embodiment, the target resource types may include, but are not limited to: audio type, video type, and graphics type, etc., and specifically, the media assets of the audio type may include: the audio book, video type media assets may include: movie shows, variety shows, etc., the media resources of the graphic type may include: cartoon, and the like.
In the embodiment of the present disclosure, the multi-modal information corresponding to each media resource may be resource description information corresponding to multiple modalities contained in the media resource, and specifically, the multi-modal information corresponding to each media resource may include, but is not limited to: resource name information, resource number information, distribution attribute information, key image information, and the like.
In the present description embodiment, the resource name information may be used to identify the media resource. In a specific embodiment, the resource name information corresponding to the media resource may be extracted from the resource text information of the media resource. Specifically, the resource text information herein may include, but is not limited to: resource title text, resource cover text, topic label text, resource content text, resource comment text, and the like.
In an alternative embodiment, the resource name information for each media resource may be determined by: 1) Extracting at least one entity word from the resource text information of each media resource; 2) Calculating the weight corresponding to each entity word in the at least one entity word; 3) Determining at least one core word in the at least one entity word based on the weight corresponding to each of the at least one entity word; 4) And taking the core word successfully matched with the preset resource name word stock in the at least one core word as the resource name information of each media resource.
Specifically, the weights corresponding to the at least one entity word may be calculated by using a natural language processing model, where the natural language processing model may be any machine learning model capable of implementing vocabulary weight analysis in the prior art, and the application is not limited thereto, and illustratively, the natural language processing model may include: BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded representation based on transducer) model, ELMo model, biLSTM-CRF model, lattice-LSTM model. Specifically, the preset resource name lexicon may be preset by an industry expert in the field of media resource processing. Alternatively, entity words between the book numbers (book numbers) can be extracted from the resource text information through regular expressions to serve as resource name information.
In the embodiment of the present specification, the resource serial number information may represent a resource scenario development sequence of the corresponding media resource. In a specific embodiment, the resource serial number information corresponding to the media resource may be extracted from the resource text information of the media resource, optionally, the resource serial number information including the sequence characterization text and the fragment word may be extracted from the resource text information by using a regular expression, where the sequence characterization text may include, but is not limited to: 1.2, 3, etc., or one, two, three, or one (one), two (two), three (three), etc., or a sky-dry text such as a, b, c, and t, etc., or a support text such as a son, a ugly, a tiger, a mortise, etc., where the fragment words may include, but are not limited to: the set (episode), the season (season), the words such as a return, a chapter, a speech and the like are schematic, the resource text information of a certain media resource comprises a 03 th set of xxx, and the 03 th set can be extracted as the resource serial number information of the media resource; the resource serial number information can be extracted from the resource text information through a natural language processing model, wherein the natural language processing model can adopt any machine learning model capable of extracting the resource scenario serial number in the prior art, and the application is not particularly limited.
In the embodiment of the present specification, the publishing attribute information may be basic information related to a publishing attribute of a media asset. In particular, the published attribute information may include, but is not limited to: release time information, resource release object identification information, and the like.
In the embodiment of the present specification, the key image information may be key image data contained in the media resource, and specifically, the key image information may include, but is not limited to: cover image information, key frame image information.
S202, extracting features of each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource.
In this embodiment of the present disclosure, the multi-modal feature information corresponding to each media resource may be feature information obtained by performing feature extraction based on resource name information of the corresponding media resource, resource serial number information of the corresponding media resource, and release attribute information of the corresponding media resource, where the multi-modal feature information corresponding to each media resource may be used to characterize a fusion feature of multi-modal information of the corresponding media resource. In a specific embodiment, the representation of the multi-modal feature information may be a multi-modal feature vector.
In a specific embodiment, as shown in fig. 3, the feature extraction of each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource may include:
S301, text semantic extraction is carried out on the resource name information, and resource name feature information is obtained.
Specifically, the resource name feature information may be feature information obtained by performing text semantic extraction on the resource name information, and the resource name feature information may be used to characterize name semantic features of the media resource. In a specific embodiment, the representation of the resource name feature information may be a resource name feature vector.
In an alternative embodiment, the resource name information may be input into a name semantic extraction model to perform context semantic extraction to obtain resource name feature information, where the name semantic extraction model may be any machine learning model capable of implementing context semantic extraction in the prior art, which is not particularly limited by the present application, and the name semantic extraction model may include, but is not limited to: BERT model, LSTM model, etc.
In an alternative embodiment, an input information may be obtained after adding a global identifier without an actual semantic meaning before the resource name information of the media resource, and the input information is input into the name semantic extraction model to perform context semantic extraction to obtain output information, where the output information may include: the semantic feature vector corresponding to the global identifier and the semantic feature vector corresponding to each character in the resource name information are used as the resource name feature information of the media resource.
S302, text semantic extraction is carried out on the release attribute information, and release attribute feature information is obtained.
Specifically, the release attribute feature information may be feature information obtained by performing text semantic extraction on the release attribute information, and the release attribute feature information may be used to characterize release attribute features of the media resource. In a particular embodiment, the presentation of the publication attribute feature information may be a publication attribute feature vector.
In an alternative embodiment, the published attribute information may be input into an attribute semantic extraction model to perform context semantic extraction to obtain published attribute feature information, where the attribute semantic extraction model may be any machine learning model capable of implementing context semantic extraction in the prior art, which is not particularly limited by the present application, and the attribute semantic extraction model may include, but is not limited to: BERT model, LSTM model, etc.
In an alternative embodiment, an input information may be obtained after adding a global identifier without an actual semantic meaning before the attribute information is released from the media resource, and the input information is input into the attribute semantic extraction model to perform context semantic extraction to obtain output information, where the output information may include: the semantic feature vector corresponding to the global identifier and the semantic feature vector corresponding to each character in the release attribute information are used as the release attribute feature information of the media resource.
In an alternative embodiment, the publishing attribute information may include publishing time information, and the attribute semantic extraction model may include: the temporal semantic extraction model, the publishing attribute feature information may include: the issuing time feature information, correspondingly, inputting the issuing attribute information into the attribute semantic extraction model to perform context semantic extraction, and obtaining the issuing attribute feature information may include: converting the data representation format of the release time information into a text representation format to obtain a release time text, inputting the release time text into a time semantic extraction model to extract context semantics, and obtaining release time characteristic information.
Illustratively, taking the time representation format of the distribution time information as 2024/01/17:17:17 as an example, the distribution time information may be converted from the time representation format to a text representation format, which is shown as 2024, 1 month, 1 day, 17 minutes, 17 seconds.
S303, carrying out position coding on the resource serial number information to obtain the characteristic information of the resource serial number.
Specifically, the feature information of the resource serial number may be feature information obtained after the position encoding of the feature information of the resource serial number, and the feature information of the resource serial number may be used to characterize the encoding feature of the resource scenario serial number of the media resource. In a specific embodiment, the representation of the resource sequence number characteristic information may be a resource sequence number encoding vector.
In an alternative embodiment, a digital text in the resource serial number information may be extracted, the digital text is converted into a digital representation format, a resource serial number is obtained, and the resource serial number is position-coded to obtain a resource serial number coding vector.
In an alternative embodiment, a sine function and a cosine function may be used to construct a resource sequence number encoding vector, and illustratively, the resource sequence number encoding vector corresponding to the resource sequence number j may be expressed as: pj= [ e (j, 1), e (j, 2), …, e (j, 2 i), e (j, 2i+1), …, e (j, d) ], wherein,,D represents the dimension of the vector space, alternatively d may be 128; n represents a custom scalar, optionally n may be 10000; i is used for mapping to the column index of the vector element, i < d/2 is more than or equal to 0, and the single value of i is mapped to the sine function and the cosine function at the same time.
S304, characteristic stitching is carried out based on the resource name characteristic information, the release attribute characteristic information and the resource serial number characteristic information, and multi-mode characteristic information is obtained.
In a specific embodiment, the resource name feature information may include: the resource name feature vector, the publishing attribute feature information may include: the issuing attribute feature vector, the resource sequence number feature information may include: the resource sequence number encoding vector, the multi-modal feature information may include: the multi-mode feature vector, correspondingly, performing feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain multi-mode feature information may include: and performing feature stitching based on the resource name feature vector, the release attribute feature vector and the resource serial number coding vector to obtain the multi-mode feature vector.
According to the embodiment, the feature splicing is performed on the release attribute feature information corresponding to the release attribute information and the resource serial number feature information corresponding to the resource serial number information based on the resource name feature information corresponding to the resource name information, so that the multi-mode feature information is obtained, and the integrity of the multi-mode feature information on the multi-mode information representation of the media resource can be improved.
In an alternative embodiment, the multi-modal information may further include: as shown in fig. 4, before the feature stitching is performed on the key image information based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain the multi-mode feature information, the method may further include:
S305, extracting image features of the key image information to obtain image feature information.
Specifically, the image feature information may be feature information obtained by extracting image features from key image information, and the image feature information may be used to represent image semantic features of the key image information of the media resource. In a specific embodiment, the representation of the image characteristic information may be an image characteristic vector.
In an alternative embodiment, the key image information may be input into an image feature extraction model to perform image semantic extraction to obtain image feature information, where the image feature extraction model may be any machine learning model capable of implementing image semantic extraction in the prior art, and the application is not limited thereto, and illustratively, the image feature extraction model may include, but is not limited to: viT (Visual Transformer) model, CNN model, transducer model, etc.
Correspondingly, performing feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain multi-mode feature information may include:
s3041, performing feature stitching on the resource name feature information, the release attribute feature information, the resource serial number feature information and the image feature information to obtain multi-mode feature information.
In a specific embodiment, the resource name feature information may include: the resource name feature vector, the publishing attribute feature information may include: the issuing attribute feature vector, the resource sequence number feature information may include: the resource sequence number encodes a vector, and the image characteristic information may include: the image feature vector, the multi-modal feature information may include: the performing feature stitching on the resource name feature information, the release attribute feature information, the resource serial number feature information and the image feature information to obtain the multi-mode feature vector may include: and performing feature stitching on the resource name feature vector, the release attribute feature vector, the resource serial number coding vector and the image feature vector to obtain the multi-mode feature vector.
According to the embodiment, the multi-modal feature information is obtained by combining the image feature information corresponding to the key image information on the basis of the resource name feature information, the release attribute feature information and the resource serial number feature information, so that the integrity and the richness of the multi-modal feature information on the multi-modal information representation of the media resource can be further improved.
S203, inputting the multi-mode feature information and the resource serial number feature information corresponding to the resource serial number information into a resource feature fusion model, and guiding the multi-mode feature information and the resource serial number feature information to carry out semantic fusion based on the feature related information among feature elements in the multi-mode feature information to obtain semantic fusion feature information corresponding to each media resource.
In this embodiment of the present disclosure, the semantic fusion feature information corresponding to each media resource may be feature information obtained by inputting the multi-mode feature information corresponding to each media resource and the resource serial number feature information corresponding to each media resource into a resource feature fusion model, and guiding the multi-mode feature information and the resource serial number feature information to perform semantic fusion based on feature related information between feature elements in the multi-mode feature information. In particular, the semantic fusion feature information may be used to characterize deep semantic features of the corresponding media asset. In a specific embodiment, the presentation form of the semantic fusion feature information may be a semantic fusion feature vector.
In the embodiment of the present disclosure, the feature related information corresponding to the multi-modal feature information may represent a correlation between a plurality of feature elements in the multi-modal feature information. In a specific embodiment, the multi-modal feature information may include: the multi-modal feature vector, the plurality of feature elements in the multi-modal feature information may include: the vector components corresponding to each of the plurality of vector dimensions in the multi-modal feature vector, and the feature correlation information may characterize correlations between the vector components corresponding to each of the plurality of vector dimensions.
In a specific embodiment, the resource feature fusion model may include: and the at least two stages of fusion modules are connected in sequence, and the module structures of all stages of fusion modules in the at least two stages of fusion modules are the same. In an alternative embodiment, the resource feature fusion model may include: and the three-level fusion modules are connected in sequence.
In a specific embodiment, inputting the resource serial number feature information corresponding to the multi-mode feature information and the resource serial number information into the resource feature fusion model, guiding the multi-mode feature information and the resource serial number feature information to perform semantic fusion based on feature related information among feature elements in the multi-mode feature information, and obtaining the semantic fusion feature information corresponding to each media resource may include:
s2031, inputting target-level input information and resource serial number feature information into a target-level fusion module, and guiding the target-level input information and the resource serial number feature information to perform cross fusion processing based on target feature related information among feature elements in the target-level input information to obtain target-level output information, wherein the target-level fusion module is any one of at least two levels of fusion modules, the first level input information is multi-mode feature information, any one of the second-level input information to the last-level input information is own last-level output information, and the last-level output information is semantic fusion feature information.
Specifically, the target level fusion module may guide the target level input information and the resource serial number feature information to perform cross fusion processing based on the target feature related information between feature elements in the target level input information.
In a specific embodiment, the input information of the first stage fusion module in the at least two stage fusion modules may include: multi-modal characteristic information (i.e. first-level input information) and resource sequence number characteristic information, the output information of the first-stage fusion module may be: based on the target feature related information between feature elements in the multi-mode feature information, guiding feature information (namely first-stage output information) obtained after the cross fusion processing of the multi-mode feature information and the resource serial number feature information, the input information from the second-stage fusion module to the kth-stage fusion module in the last-stage fusion module can comprise: the output information of the kth-1 level output information (namely the kth level input information) and the characteristic information of the resource serial number, the output information of the kth level fusion module can be: and guiding the characteristic information (namely the kth-level output information) obtained by performing cross fusion processing on the kth-1-level output information and the resource serial number characteristic information based on the target characteristic related information among the characteristic elements in the kth-1-level output information.
In a specific embodiment, the target level input information may be represented as a target level input vector and the target level output information may be represented as a target level output vector.
According to the embodiment, the multi-mode feature information and the resource serial number feature information of the media resource are subjected to semantic fusion through the resource feature fusion model comprising at least two stages of fusion modules which are connected in sequence, so that the semantic fusion feature information corresponding to the media resource is obtained, the deep feature information of the media resource can be effectively extracted, and the accuracy of the semantic fusion feature information on deep semantic representation of the media resource is improved.
In a specific embodiment, the target level fusion module may include: the above-mentioned input information of target level and characteristic information of resource serial number are input into the fusion module of target level, as shown in fig. 5, and based on the relevant information of target characteristic between characteristic elements in the input information of target level, the input information of target level and characteristic information of resource serial number are guided to carry out cross fusion processing, and the obtaining output information of target level includes:
S501, inputting the target-level input information into an attention analysis layer, and performing correlation analysis on characteristic elements in the target-level input information to obtain target characteristic related information.
In particular, the target feature related information may characterize a correlation between a plurality of feature elements in the target level input information. In a specific embodiment, the target level input information may include: the target level input vector, the plurality of feature elements in the multi-modal feature information may include: the vector components corresponding to each of the plurality of vector dimensions in the target-level input vector, and the target feature related information may characterize a correlation between the vector components corresponding to each of the plurality of vector dimensions.
In an alternative embodiment, the module algorithm of the attention analysis layer of the kth level fusion module may be expressed as: Wherein, the method comprises the steps of, wherein, Representing the kth level of input vector, d representing the size of the dimension of the vector space, alternatively d may be 128,AndFor trainable feature mapping parameters, the feature mapping parameters may be, optionally,AndIt may be a feature mapping matrix that,The normalization process is represented.
S502, inputting the target-level input information and the resource serial number characteristic information into a cross fusion layer to perform cross fusion processing, and obtaining target cross characteristic information.
Specifically, the target cross feature information may be feature information obtained by performing cross fusion processing on the target level input information and the resource serial number feature information, and the target cross feature information may represent cross fusion features of the target level input information and the resource serial number feature information. In a specific embodiment, in a case where the expression form of the target level input information is a target level input vector and the expression form of the resource serial number feature information is a resource serial number code vector, the expression form of the target cross feature information may be a target cross feature vector.
In an alternative embodiment, the module algorithm of the cross-fusion layer of the kth level fusion module may be expressed as: Wherein, the method comprises the steps of, wherein, Represents a kth level input vector, P represents a resource sequence number encoding vector,For trainable feature mapping parameters, the feature mapping parameters may be, optionally,May be a feature mapping matrix.
S503, inputting the target feature related information and the target cross feature information into a weighting layer, and carrying out weighted fusion processing on the target cross feature information based on the target feature related information to obtain target-level output information.
Specifically, the target level output information may be feature information obtained by performing weighted fusion processing on the target cross feature information based on the target feature related information. In a specific embodiment, where the representation of the target cross feature information is a target cross feature vector, the representation of the target level output information may be a target level output vector.
In an alternative embodiment, the module algorithm of the weighting layer of the kth level fusion module may be expressed as:。
In an alternative embodiment, the attention analysis layer, the cross fusion layer and the weighting layer may be sequentially connected to obtain an attention fusion unit as shown in fig. 6a, and correspondingly, the target level fusion module may further include a multi-head attention module as shown in fig. 6b, where the multi-head attention module may include multiple attention fusion units in parallel and a feature merging unit that merges output information of the multiple attention fusion units, where each attention fusion unit may cross-fuse target level input information mapped to different feature subspaces with resource serial number feature information, and the feature merging unit may merge output information of the multiple attention fusion units in a weighted splicing manner, and may also obtain a more complex feature representation through linear transformation. In an alternative embodiment, the multi-headed attention module may comprise 8 attention fusion units in parallel.
In an alternative embodiment, the kth level fusion module corresponds to a multi-headed attention moduleThe modular algorithm of (2) can be expressed as:
,
,
Wherein r represents the number of attention fusion units in the multi-head attention module, Represents the g-th attention fusion unit in the k-th level multi-head attention module,The feature stitching is represented and is performed,For trainable feature mapping parameters, the feature mapping parameters may be, optionally,May be a feature mapping matrix.
In an alternative embodiment, as shown in fig. 6c, the target level fusion module may further include: the device comprises a layer normalization module and a characteristic activation module, wherein the layer normalization module can be used for carrying out layer normalization processing on output information of the multi-head attention module, the characteristic activation module can be used for carrying out nonlinear mapping on the output information of the characteristic activation module, in an alternative embodiment, an activation function adopted by the characteristic activation module can be Relu (RECTIFIED LINEAR unit), and specifically, relu can output after all inputs of the upper layer which are smaller than 0 are changed into 0, and the output which is larger than 0 is unchanged.
In an alternative embodiment, the kth stage fusion moduleThe corresponding modular algorithm can be expressed as:,
,
,
Wherein, AndFor the trainable feature map parameters Relu () represents the activation function and Norm () represents the layer normalization operation.
For an example, taking a resource feature fusion model including three levels of fusion modules connected in sequence as an illustration, refer to fig. 6d, and fig. 6d is a schematic structural diagram of a resource feature fusion model provided by an embodiment of the present application.
According to the embodiment, based on the target feature related information among the feature elements in the target-level input information, the cross fusion feature information of the target-level input information and the resource sequence number feature information is weighted to obtain the target-level output information, so that the fusion effect of the target-level input information and the resource sequence number feature information can be effectively improved, and the accuracy of the target-level output information on the representation of the resource scenario sequence feature is improved.
S204, based on the semantic fusion characteristic information corresponding to each media resource, the plurality of media resources are subjected to resource aggregation, and a plurality of media resource sets are determined.
In this embodiment of the present disclosure, the multiple media resource aggregate sets may be multiple aggregate sets obtained by aggregating resources of multiple media resources based on semantic fusion feature information corresponding to each media resource, where each media resource aggregate set includes multiple media resources belonging to the same resource category, and resource categories corresponding to different media resource aggregate sets are different.
In an optional embodiment, the aggregating the plurality of media resources based on the semantic fusion feature information corresponding to each media resource, and determining the plurality of media resource aggregate may include:
S2041, respectively inputting semantic fusion characteristic information corresponding to each media resource into a resource classification model to classify the resources, and determining the resource category to which each media resource belongs.
In a specific implementation, the resource classification model may perform resource classification on semantic fusion feature information corresponding to an input media resource, and specifically, the resource classification model may be any machine learning model capable of implementing a resource classification function in the prior art, which illustratively may include, but is not limited to: support Vector Machines (SVMs), deep learning models, and the like.
In a specific embodiment, the semantic fusion feature information corresponding to each media resource may be respectively input into a resource classification model to perform resource classification, so as to obtain class indication information corresponding to each media resource, where the class indication information may be used to indicate probabilities that each media resource belongs to a plurality of preset resource classes, and a preset resource class with the highest corresponding probability in the plurality of preset resource classes is used as a resource class to which each media resource belongs based on the class indication information.
In an alternative embodiment, the expression form of the class indication information may be a class indication vector, taking a plurality of preset resource classes as Q preset resource classes as an example, the class indication vector may be a Q-dimensional vector, the class indication vector includes vector components corresponding to Q vector dimensions, each vector component corresponding to a vector dimension corresponds to one preset resource class, and a numerical value of the vector component corresponding to each vector dimension represents a probability that the media resource belongs to the corresponding preset resource class.
In an alternative embodiment, the number of preset resource categories may be set in combination with the number of preset resource names in the preset resource name lexicon. In an alternative embodiment, the number of preset resource categories may be equal to the number of preset resource names.
In an alternative embodiment, the resource feature fusion model and the resource classification model may be combined to obtain a media resource cluster model as shown in FIG. 6 e.
In an alternative embodiment, the resource feature fusion model and the resource classification model may be jointly trained, or the resource feature fusion model and the resource classification model may be independently trained.
In an alternative embodiment, the resource feature fusion model and the resource classification model may be trained using a joint training method as follows:
s1, acquiring sample multi-mode characteristic information corresponding to each sample media resource, sample resource serial number characteristic information corresponding to each sample media resource and labeling resource category information corresponding to each sample media resource in a plurality of sample media resources.
In practical application, before model training is performed, training data may be determined first, specifically, in the embodiment of the present application, sample media resources and labeled resource type information corresponding to the sample media resources may be obtained as training data, and a representation form of the labeled resource type information may be a resource type label, where it may be understood that multiple sample media resources belonging to the same resource name in practical application may be labeled with the same resource type label.
S2, inputting sample multi-modal feature information and sample resource serial number feature information corresponding to each sample media resource into a feature fusion model to be trained, and guiding the sample multi-modal feature information and the sample resource serial number feature information to carry out semantic fusion based on feature related information among feature elements in the sample multi-modal feature information to obtain sample semantic fusion feature information corresponding to each sample media resource.
And S3, inputting sample semantic fusion characteristic information corresponding to each sample media resource into a classification model to be trained for resource classification, and obtaining predicted resource category information corresponding to each sample media resource.
And S4, determining category prediction loss information based on the labeling resource category information and the prediction resource category information.
And S5, training a feature fusion model to be trained and a classification model to be trained based on the category prediction loss information to obtain a resource feature fusion model and a resource classification model.
In a specific embodiment, determining the category prediction loss information based on the labeling resource category information and the prediction resource category information may include determining category prediction loss information between the labeling resource category information and the prediction resource category information based on a preset loss function.
In a particular embodiment, the category prediction loss information may characterize the difference between the labeling resource category information and the prediction resource category information.
In a particular embodiment, the preset penalty function may include, but is not limited to, a cross entropy penalty function, a logic penalty function, an exponential penalty function, and the like.
Illustratively, taking a preset loss function as an example of a cross entropy loss function, the class prediction loss information may be expressed as the following formula:,
wherein M represents the number of sample media assets; n represents the number of preset resource categories; Marking information for indicating the nth preset resource category of the mth sample media resource, namely marking whether the mth sample media resource belongs to the nth preset resource category; And predicting information representing the nth preset resource category of the mth sample media resource, namely predicting whether the mth sample media resource belongs to the nth preset resource category.
In an alternative embodiment, training the feature fusion model to be trained and the classification model to be trained based on the class prediction loss information, to obtain the resource feature fusion model and the resource classification model may include:
s6, updating model parameters of the feature fusion model to be trained and model parameters of the classification model to be trained based on the category prediction loss information;
S7, repeating the resource clustering training iteration operation of the steps S2 to S4 and S6 based on the updated feature fusion model to be trained and the updated classification model to be trained until the resource clustering convergence condition is reached; and taking the feature fusion model to be trained and the classification model to be trained which are obtained under the condition that the resource clustering convergence condition is reached as a resource feature fusion model and a resource classification model.
In an optional embodiment, the achieving the resource cluster convergence condition may be that the number of training iteration operations reaches a preset training number. Alternatively, reaching the resource cluster convergence condition may also predict that the loss information is less than a specified threshold for the current category. In this embodiment of the present disclosure, the preset training times and the specified threshold may be preset in combination with the training speed and the accuracy of the network in practical application.
S2042, the media resources belonging to the same resource category in the plurality of media resources are aggregated to obtain a plurality of media resource sets.
In an alternative embodiment, feature clustering algorithm may be further used to perform feature clustering on the semantic fusion feature information corresponding to the multiple media resources, so as to obtain multiple clustering centers, and the media resources corresponding to the semantic fusion feature information belonging to the same clustering center are used as a media resource collection. Illustratively, the feature clustering algorithm herein may include, but is not limited to: k means clustering algorithm, agglomerative Clustering (hierarchical clustering) and the like, which are not particularly limited in the present application.
According to the embodiment, the semantic fusion characteristic information corresponding to each media resource is respectively input into the resource classification model to classify the resources, and the resource category to which each media resource belongs is determined, so that the media resources belonging to the same resource category in the plurality of media resources are aggregated to obtain a plurality of media resource aggregate sets, and the accuracy and the efficiency of media resource aggregation can be improved.
S205, based on the multi-mode information corresponding to the media resources in each media resource collection, the media resources in each media resource collection are ordered, and a media resource sequence corresponding to each media resource collection is generated, wherein the media resource sequence carries collection name information of the corresponding media resource collection.
In this embodiment of the present disclosure, the media resource sequence corresponding to each media resource collection may include a plurality of media resources with the same resource name but consecutive resource scenario sequences.
According to the embodiment, the multi-modal information corresponding to each media resource in the target resource type is obtained in an artificial intelligence mode, the multi-modal characteristic information corresponding to each media resource is obtained based on the resource name information, the resource sequence number information and the release attribute information in the multi-modal information, the multi-modal characteristic information corresponding to each media resource is obtained, then the characteristic related information among the characteristic elements in the multi-modal characteristic information is guided to perform semantic fusion on the resource sequence number characteristic information corresponding to the multi-modal characteristic information, the semantic fusion characteristic information corresponding to each media resource is obtained, the accuracy of the semantic fusion characteristic information on the deep semantic information representation of the media resource can be improved, then the resource aggregation is performed on the plurality of media resources based on the semantic fusion characteristic information corresponding to each media resource, the multi-media resource aggregate is determined, the media resources in the media resource aggregate are sequenced based on the multi-modal information corresponding to each media resource aggregate, the media resource sequence carrying the aggregate name information is generated, the accuracy of the media resource aggregate can be improved, and the accuracy of the media sequence of the media aggregate can be guaranteed, and the media sequence of the media aggregate request accuracy of the media aggregate request can be guaranteed.
In an alternative embodiment, the publishing attribute information includes: before the publishing time, the method may further include ordering the media resources in each media resource collection based on the multimodal information corresponding to the media resources in each media resource collection to generate a media resource sequence corresponding to each media resource collection:
1) And performing text repetition analysis on any two media resources in each media resource collection to obtain text repetition results corresponding to the any two media resources.
In particular, the text repetition result may characterize a text repetition condition between corresponding two media assets.
2) And determining the time interval between the release times corresponding to any two media resources.
3) And determining the length difference information between the resource lengths corresponding to any two media resources.
In particular, the length difference information may characterize a resource length difference between corresponding two media resources. Illustratively, the length difference information may be duration difference information in the case that the resource type of the two media resources is an audio type or a video type, and the length difference information may be space length difference information in the case that the resource type of the two media resources is an image-text type.
4) And carrying out resource de-duplication processing on each media resource collection based on the text repetition result, the time interval and the length difference information to obtain de-duplication sets corresponding to each media resource collection.
According to the embodiment, based on the text repetition result, the time interval and the length difference information, the resource de-duplication processing is performed on each media resource collection to obtain the de-duplication collection corresponding to each media resource collection, so that the resource aggregation quality in the media resource collection can be further improved.
In an alternative embodiment, as shown in fig. 7, the resource sequence number information may include: the step of sorting the media resources in each media resource collection based on the multi-mode information corresponding to the media resources in each media resource collection, and the step of generating the media resource sequence corresponding to each media resource collection may include:
s2051, splitting each media resource collection based on the first-level serial number information corresponding to the media resources in each media resource collection to obtain a plurality of sub-collections corresponding to each media resource collection, wherein the first-level serial number information corresponding to the media resources in each sub-collection is the same.
S2052, based on the secondary sequence number information corresponding to the media resources in each sub-set, the media resources in each sub-set are ordered, and a media resource sequence corresponding to each sub-set is generated.
Specifically, the resource serial number information corresponding to each media resource may include: primary sequence number information and secondary sequence number information. In practical application, the level of the sequence number information can be determined through the word measuring range corresponding to the fragment words in the sequence number information, the level of the sequence number information can also be determined through the sequence of the sequence number information, and schematically, the resource sequence number information corresponding to the media resource a of the video type is expressed as a 2 nd quarter 3 rd set, the 2 nd quarter can be used as first-level sequence number information, and the 3 rd set is used as second-level sequence number information; the resource serial number information corresponding to the media resource b of the graphic type is expressed as part 2 and part 3, and the part 2 can be used as the first-level serial number information and the part 3 can be used as the second-level serial number information.
As can be seen from the above embodiments, based on the primary sequence number information corresponding to the media resources in each media resource collection, splitting processing is performed on each media resource collection to obtain multiple sub-collections with different primary sequence number information, and based on the secondary sequence number information corresponding to the media resources in each sub-collection, the media resources in each sub-collection are ordered to generate a media resource sequence corresponding to each sub-collection, so that the resource magnitude corresponding to the media resource sequence can be reduced, and the accuracy of resource searching and resource displaying is improved.
In an alternative embodiment, as shown in fig. 8, the above multi-modal information may further include: the resource comment text information, and the collection name information corresponding to each media resource collection can be obtained by the following way:
s801, determining common resource segments of media resources in each media resource collection.
In particular, the common resource segments of each media resource collection may be resource segments common to the media resources in the corresponding media resource collection. Illustratively, the common resource segments may include, but are not limited to: a resource slice head, a resource slice tail and the like.
S802, extracting text from the public resource fragments to obtain public fragment text information.
Specifically, text extraction can be performed by adopting a corresponding text extraction model according to the resource type of the public resource segment to obtain public segment text information, and the text extraction model can be an ASR speech recognition model when the resource type is an audio type, and the text extraction model can be an OCR optical character recognition model when the resource type is a video type or a picture type.
S803, extracting resource names of the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection to obtain a plurality of name information.
S804, the name information with highest co-occurrence frequency in the plurality of name information is used as the collection name information corresponding to each media resource collection.
In an alternative embodiment, the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection may be input into a text generation class model to perform text generation, so as to obtain collection name information corresponding to each media resource collection. The text generation class model may be any machine learning model that can implement a text generation function in the prior art, which is not particularly limited by the present application, and illustratively, the text generation class model may include: GPT, chatGPT, etc.
As can be seen from the above embodiments, the name information with the highest co-occurrence frequency among the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection is used as the collection name information corresponding to each media resource collection, so as to ensure that the collection name information can represent the resource content in the collection.
Referring to fig. 9, fig. 9 is a schematic diagram of a media resource aggregation system according to an embodiment of the present application. Specifically, the media resource aggregation system may include: machine learning aggregation subsystem, aggregate deduplication subsystem, aggregate inspection subsystem, wherein:
The machine learning aggregation subsystem may be configured to aggregate a plurality of media resources into a plurality of initial resource pools;
The aggregate deduplication subsystem can be used for detecting whether missing and mixed media resources exist in the initial resource collection, so that the mixed resources are removed, missing resources are completed, and a target resource collection is obtained;
The aggregation checking subsystem may be configured to secondarily modify global attribute information (resource sequence information, aggregate name information) of the target resource aggregate, arrange media resources in the target resource aggregate in a scenario-coherent order, and output aggregate names that may represent all resource contents in the target resource aggregate.
In a particular embodiment, the machine learning aggregation subsystem may include: the multi-modal information extraction unit may be configured to extract multi-modal information of each media resource, and perform a secondary check with a specific regular expression, where the multi-modal information may include: resource name information, resource serial number information, release attribute information, key image information, and the like; the resource preliminary aggregation unit may determine a resource category of each media resource through a media resource cluster model as shown in fig. 6e, so that media resources belonging to the same resource category form a resource aggregate.
In a specific embodiment, the aggregate deduplication subsystem may comprise: absent order a retrieval unit, a resource duplication removal unit, a mergence splitting unit and a resource filtering unit, wherein the absent order retrieval unit can be used for judging whether media resources with a certain sequence number are absent according to the sequence number continuous condition of each media resource in the mergence, and then searching whether follow-up resources which can be linked on the sequence number of a plot exist in the published resources of a resource publication object according to the resource publication sequence of the resource publication object, wherein the resource texts of the front and rear resources meet the same regular expression; the resource deduplication unit can be used for removing repeated resources in each aggregation set, judging whether the repetition degree of texts such as titles, covers, resource contents and the like is considered when the resources are repeated, the resource publishing time interval and the resource duration are close, in addition, a resource publishing object can repeatedly send resources with the same sequence number, and two resources have the condition of one long and one short, and in this case, one resource needs to be removed according to the resource duration, the publishing time and the content semantic relativity; the aggregate splitting unit can split media resources belonging to different levels of sequence numbers in the resource aggregate into a plurality of sub-aggregates; and the resource filtering unit is used for filtering the forecast resources and the battle resources in the resource aggregate according to the sequence numbers, the resource duration and the release time interval.
In a particular embodiment, the aggregate inspection subsystem may include: the resource sequence correction unit can be used for detecting whether resources with unpaired sequence exist in the aggregate according to the serial number continuity condition in each resource aggregate, and re-sequencing and sequencing are needed to be carried out according to the release time of each resource in the aggregate when the inconsistent serial numbers are detected; the aggregate name checking unit may be configured to extract, as the aggregate name of the aggregate of the resources, a name with the highest co-occurrence frequency according to the resource names, the common resource segments, and the resource comment information of the resources in the aggregate.
In an alternative embodiment, as shown in fig. 10, the method may further include:
S206, determining target collection name information matched with the target search word in response to the resource search request carrying the target search word.
Specifically, under the condition that the target search word is a complete search word, the collection name information containing the target search word is used as target collection name information; in the case where the target search word is an abbreviation search word, the album name information corresponding to the abbreviation in agreement with the target search word is taken as target album name information.
S207, displaying the media resource sequence corresponding to the target collection name information.
According to the embodiment, a plurality of media resources related to the plot content are aggregated into a media resource sequence carrying the aggregate name information according to the plot development sequence, so that when a resource browsing object searches for a target search word, the resource browsing object can directly reach the media resource sequence corresponding to the target search word, and thereby the plot development coherent resource browsing experience is possessed.
According to the technical scheme provided by the embodiment of the application, the multi-modal information corresponding to each media resource in the plurality of media resources of the target resource type is obtained in an artificial intelligence mode, the multi-modal characteristic information corresponding to each media resource is obtained by carrying out characteristic extraction on each media resource based on the resource name information, the resource serial number information and the release attribute information in the multi-modal information, then the characteristic related information among the characteristic elements in the multi-modal characteristic information is guided to carry out semantic fusion on the resource serial number characteristic information corresponding to the multi-modal characteristic information, the semantic fusion characteristic information corresponding to each media resource is obtained, the semantic fusion characteristic information can promote the accuracy of deep semantic information representation of the media resource, then the resource aggregation is carried out on the plurality of media resources based on the semantic fusion characteristic information corresponding to each media resource, the media resources in the media resource aggregation set are determined, the media resources in the media resource aggregation set are sequenced based on the multi-modal information corresponding to each media resource, the media resource sequence carrying the information of the media resource aggregation set is generated, the media sequence corresponding to the media resource aggregation set carrying the semantic information, the semantic fusion characteristic information corresponding to the media sequence of the media resource serial number information can be promoted, the semantic fusion characteristic information can promote the semantic fusion characteristic information corresponding to the deep semantic fusion characteristic information to the media resource, the media resource can be focused on the media resource serial, the media sequence can be searched for the target object can be directly and can be searched, and can be seen.
The embodiment of the application also provides a media resource aggregation device, as shown in fig. 11, which may include:
The multi-mode information obtaining module 1110 is configured to obtain multi-mode information corresponding to each media resource in the plurality of media resources of the target resource type, where the multi-mode information includes: resource name information, resource serial number information and release attribute information;
The feature extraction module 1120 is configured to perform feature extraction on each media resource based on the resource name information, the resource serial number information, and the release attribute information, so as to obtain multi-mode feature information corresponding to each media resource;
The semantic fusion module 1130 is configured to input resource serial number feature information corresponding to the multi-mode feature information and the resource serial number information into a resource feature fusion model, and guide the multi-mode feature information and the resource serial number feature information to perform semantic fusion based on feature related information between feature elements in the multi-mode feature information, so as to obtain semantic fusion feature information corresponding to each media resource;
A resource aggregation module 1140, configured to aggregate resources of the plurality of media resources based on the semantic fusion feature information corresponding to each media resource, and determine a plurality of media resource aggregate sets;
The sequence generating module 1150 is configured to sort the media resources in each media resource collection based on the multimodal information corresponding to the media resources in each media resource collection, and generate a media resource sequence corresponding to each media resource collection, where the media resource sequence carries collection name information of the corresponding media resource collection.
In a specific embodiment, the feature extraction module 1120 may include:
the name feature extraction unit is used for carrying out text semantic extraction on the resource name information to obtain the resource name feature information;
The release characteristic extraction unit is used for extracting text semantics of the release attribute information to obtain release attribute characteristic information;
the serial number feature extraction unit is used for carrying out position coding on the resource serial number information to obtain the resource serial number feature information;
The first feature stitching unit is used for performing feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain multi-mode feature information.
In an alternative embodiment, the multi-modal information may further include: the key image information, the feature extraction module 1120 may further include:
The image feature extraction unit is used for extracting image features of the key image information to obtain image feature information;
The first feature stitching unit may include:
and the second feature stitching unit is used for performing feature stitching on the resource name feature information, the release attribute feature information, the resource serial number feature information and the image feature information to obtain multi-mode feature information.
In a specific embodiment, the resource feature fusion model may include: at least two levels of fusion modules connected in sequence, the semantic fusion module 1130 may include:
The target feature fusion unit is used for inputting target-level input information and resource serial number feature information into the target-level fusion module, guiding the target-level input information and the resource serial number feature information to perform cross fusion processing based on target feature related information among feature elements in the target-level input information to obtain target-level output information, wherein the target-level fusion module is any one of at least two levels of fusion modules, the first level input information is multi-mode feature information, any one of the second-level input information to the last-level input information is own last-level output information, and the last-level output information is semantic fusion feature information.
In a specific embodiment, the target level fusion module may include: the target feature fusion unit comprises an attention analysis layer, a cross fusion layer and a weighting layer, wherein the target feature fusion unit comprises:
the attention analysis unit is used for inputting the target-level input information into the attention analysis layer, and carrying out correlation analysis on characteristic elements in the target-level input information to obtain target characteristic related information;
The cross fusion unit is used for inputting the target-level input information and the resource serial number characteristic information into the cross fusion layer to perform cross fusion processing to obtain target cross characteristic information;
the weighting unit is used for inputting the target feature related information and the target cross feature information into the weighting layer, and carrying out weighted fusion processing on the target cross feature information based on the target feature related information to obtain target-level output information.
In an alternative embodiment, the resource aggregation module 1140 may include:
the resource classification unit is used for respectively inputting semantic fusion characteristic information corresponding to each media resource into the resource classification model to carry out resource classification and determining the resource category to which each media resource belongs;
And the aggregation processing unit is used for carrying out aggregation processing on the media resources belonging to the same resource category in the plurality of media resources to obtain a plurality of media resource aggregate.
In an alternative embodiment, the publishing attribute information includes: the device may further include:
The text repetition analysis module is used for performing text repetition analysis on any two media resources in each media resource collection to obtain text repetition results corresponding to any two media resources;
the time interval determining module is used for determining the time interval between the release times corresponding to any two media resources;
the length difference determining module is used for determining length difference information between the resource lengths corresponding to any two media resources;
And the resource de-duplication module is used for performing resource de-duplication processing on each media resource collection based on the text repetition result, the time interval and the length difference information to obtain a de-duplication set corresponding to each media resource collection.
In an alternative embodiment, the resource sequence number information may include: primary sequence number information and secondary sequence number information, the sequence generating module 1150 may include:
the aggregation splitting unit is used for splitting each media resource aggregation based on the first-level serial number information corresponding to the media resources in each media resource aggregation to obtain a plurality of sub-aggregation sets corresponding to each media resource aggregation, and the first-level serial number information corresponding to the media resources in each sub-aggregation set is the same;
And the sub-collection resource ordering unit is used for ordering the media resources in each sub-collection based on the secondary sequence number information corresponding to the media resources in each sub-collection, and generating a media resource sequence corresponding to each sub-collection.
In an alternative embodiment, the multi-modal information may further include: the resource comment text information, and the collection name information corresponding to each media resource collection can be obtained by the following devices:
A common resource segment determining module, configured to determine a common resource segment of the media resources in each media resource aggregate;
The segment text extraction module is used for extracting the text of the public resource segment to obtain public segment text information;
The resource name extraction module is used for extracting resource names of the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection to obtain a plurality of name information;
And the aggregate name determining module is used for taking the name information with highest co-occurrence frequency in the plurality of name information as aggregate name information corresponding to each media resource aggregate.
In an alternative embodiment, the apparatus may further include:
the target collection name information determining module is used for responding to a resource searching request carrying target searching words and determining target collection name information matched with the target searching words;
and the resource sequence display module is used for displaying the media resource sequence corresponding to the target collection name information.
It should be noted that, the apparatus and method embodiments in the apparatus embodiments are based on the same inventive concept.
The embodiment of the application provides media resource aggregation equipment, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the media resource aggregation method provided by the embodiment of the method.
Further, fig. 12 is a schematic hardware structure diagram of a media resource aggregation device for implementing the media resource aggregation method provided by the embodiment of the present application, where the media resource aggregation device may participate in forming or including the media resource aggregation apparatus provided by the embodiment of the present application. As shown in fig. 12, the media asset aggregation device 120 may include one or more processors 1202 (shown in fig. 12 as 1202a, 1202b, … …,1202 n) (the processor 1202 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 1204 for storing data, and a transmission 1206 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 12 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the media asset aggregation device 120 may also include more or fewer components than shown in fig. 12, or have a different configuration than shown in fig. 12.
It should be noted that the one or more processors 1202 and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Further, the data processing circuitry may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the media asset aggregation device 120 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).
The memory 1204 may be used for storing software programs and modules of application software, and the processor 1202 executes the software programs and modules stored in the memory 1204 to perform various functional applications and data processing, i.e., implement a media resource aggregation method according to the embodiments of the present application. Memory 1204 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1204 may further include memory remotely located relative to the processor 1202, which may be connected to the media asset aggregation device 120 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 1206 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the media asset aggregation device 120. In one example, the transmission means 1206 includes a network adapter (NetworkInterfaceController, NIC) that can be connected to other network devices via a base station to communicate with the internet. In one embodiment, the transmission means 1206 may be a radio frequency (RadioFrequency, RF) module for communicating wirelessly with the internet.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the media asset aggregation device 120 (or mobile device).
Embodiments of the present application also provide a computer readable storage medium that may be provided in a media asset aggregation apparatus to hold at least one instruction or at least one program for implementing the media asset aggregation method provided by the above-described method embodiments, the at least one instruction or the at least one program being loaded and executed by the processor.
Alternatively, in this embodiment, the storage medium may be located in at least one network server among a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform a media asset aggregation method as provided by the method embodiments.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.
Claims (19)
1. A method of aggregating media assets, the method comprising:
acquiring multi-mode information corresponding to each media resource in a plurality of media resources of a target resource type, wherein the multi-mode information comprises: the method comprises the steps of resource name information, resource serial number information and release attribute information, wherein the resource serial number information represents the overall resource scenario development sequence of corresponding media resources;
Performing feature extraction on each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource;
inputting the target-level input information into an attention analysis layer in a target-level fusion module, and performing correlation analysis on characteristic elements in the target-level input information to obtain target characteristic related information;
Inputting the target-level input information and the resource serial number characteristic information corresponding to the resource serial number information into a cross fusion layer in the target-level fusion module to perform cross fusion processing to obtain target cross characteristic information;
The target feature related information and the target cross feature information are input into a weighting layer in the target level fusion module, weighted fusion processing is carried out on the target cross feature information based on the target feature related information, and target level output information is obtained, wherein the target level fusion module is any one level fusion module in at least two levels of fusion modules, the first level input information is multi-mode feature information corresponding to each media resource, any one level of input information from the second level input information to the last level input information is upper level output information of the last level input information, and the last level output information is semantic fusion feature information corresponding to each media resource;
based on the semantic fusion characteristic information corresponding to each media resource, carrying out resource aggregation on the plurality of media resources, and determining a plurality of media resource sets, wherein scenario relevance exists among the media resources in each media resource set;
Based on the multi-mode information corresponding to the media resources in each media resource collection, the media resources in each media resource collection are ordered, and a media resource sequence corresponding to each media resource collection is generated, wherein the media resource sequence carries collection name information of the corresponding media resource collection.
2. The method of claim 1, wherein the feature extraction of each media resource based on the resource name information, the resource serial number information, and the distribution attribute information to obtain multi-modal feature information corresponding to each media resource comprises:
text semantic extraction is carried out on the resource name information to obtain resource name characteristic information;
text semantic extraction is carried out on the release attribute information to obtain release attribute characteristic information;
position coding is carried out on the resource serial number information to obtain the characteristic information of the resource serial number;
And performing feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain the multi-mode feature information.
3. The method of claim 2, wherein the multi-modal information further comprises: the key image information is subjected to feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information, and before the multi-mode feature information is obtained, the method further comprises the steps of:
Extracting image features of the key image information to obtain image feature information;
The performing feature stitching based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain the multi-mode feature information includes:
and performing feature stitching on the resource name feature information, the release attribute feature information, the resource serial number feature information and the image feature information to obtain the multi-mode feature information.
4. The method of claim 1, wherein the aggregating the plurality of media resources based on the semantic fusion feature information corresponding to each media resource, determining a plurality of media resource collections comprises:
respectively inputting semantic fusion characteristic information corresponding to each media resource into a resource classification model to classify the resources, and determining the resource category to which each media resource belongs;
And aggregating the media resources belonging to the same resource category in the plurality of media resources to obtain the plurality of media resource aggregate.
5. The method of claim 1, wherein the publishing attribute information comprises: the publishing time, before the media resources in each media resource collection are ordered based on the multi-mode information corresponding to the media resources in each media resource collection, and the media resource sequence corresponding to each media resource collection is generated, the method further includes:
performing text repetition analysis on any two media resources in each media resource collection to obtain text repetition results corresponding to the any two media resources;
determining a time interval between release times corresponding to the any two media resources;
Determining length difference information between resource lengths corresponding to the arbitrary two media resources;
and carrying out resource de-duplication processing on each media resource collection based on the text repetition result, the time interval and the length difference information to obtain de-duplication sets corresponding to each media resource collection.
6. The method of claim 1, wherein the resource sequence number information comprises: the step of sorting the media resources in each media resource collection based on the multi-mode information corresponding to the media resources in each media resource collection, and the step of generating the media resource sequence corresponding to each media resource collection includes:
Splitting each media resource collection based on the first-level serial number information corresponding to the media resources in each media resource collection to obtain a plurality of sub-collections corresponding to each media resource collection, wherein the first-level serial number information corresponding to the media resources in each sub-collection is the same;
and sequencing the media resources in each subset based on the secondary sequence number information corresponding to the media resources in each subset, and generating a media resource sequence corresponding to each subset.
7. The method of claim 1, wherein the multi-modal information further comprises: the method further comprises the steps of:
determining common resource segments of the media resources in each media resource collection;
text extraction is carried out on the public resource fragments to obtain public fragment text information;
Extracting resource names of the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection to obtain a plurality of name information;
and taking the name information with highest co-occurrence frequency in the plurality of name information as the collection name information corresponding to each media resource collection.
8. The method according to any one of claims 1 to 7, further comprising:
responding to a resource search request carrying a target search word, and determining target collection name information matched with the target search word;
and displaying the media resource sequence corresponding to the target collection name information.
9. A media asset aggregation apparatus, the apparatus comprising:
The multi-mode information acquisition module is used for acquiring multi-mode information corresponding to each media resource in a plurality of media resources of a target resource type, and the multi-mode information comprises: the method comprises the steps of resource name information, resource serial number information and release attribute information, wherein the resource serial number information represents the overall resource scenario development sequence of corresponding media resources;
the feature extraction module is used for extracting features of each media resource based on the resource name information, the resource serial number information and the release attribute information to obtain multi-mode feature information corresponding to each media resource;
The semantic fusion module is used for inputting the target-level input information into the attention analysis layer in the target-level fusion module, and carrying out correlation analysis on characteristic elements in the target-level input information to obtain target characteristic related information; inputting the target-level input information and the resource serial number characteristic information corresponding to the resource serial number information into a cross fusion layer in the target-level fusion module to perform cross fusion processing to obtain target cross characteristic information; the target feature related information and the target cross feature information are input into a weighting layer in the target level fusion module, weighted fusion processing is carried out on the target cross feature information based on the target feature related information, and target level output information is obtained, wherein the target level fusion module is any one level fusion module in at least two levels of fusion modules, the first level input information is multi-mode feature information corresponding to each media resource, any one level of input information from the second level input information to the last level input information is upper level output information of the last level input information, and the last level output information is semantic fusion feature information corresponding to each media resource;
The resource aggregation module is used for carrying out resource aggregation on the plurality of media resources based on the semantic fusion characteristic information corresponding to each media resource, and determining a plurality of media resource sets, wherein plot relevance exists between the media resources in each media resource set;
The sequence generating module is used for sequencing the media resources in each media resource collection based on the multi-mode information corresponding to the media resources in each media resource collection, and generating a media resource sequence corresponding to each media resource collection, wherein the media resource sequence carries collection name information of the corresponding media resource collection.
10. The apparatus of claim 9, wherein the feature extraction module comprises:
the name feature extraction unit is used for carrying out text semantic extraction on the resource name information to obtain resource name feature information;
The release characteristic extraction unit is used for extracting text semantics of the release attribute information to obtain release attribute characteristic information;
The serial number feature extraction unit is used for carrying out position coding on the resource serial number information to obtain the resource serial number feature information;
And the first feature splicing unit is used for carrying out feature splicing based on the resource name feature information, the release attribute feature information and the resource serial number feature information to obtain the multi-mode feature information.
11. The apparatus of claim 10, wherein the multi-modal information further comprises: key image information, the feature extraction module further includes:
The image feature extraction unit is used for extracting image features of the key image information to obtain image feature information;
The first feature stitching unit includes:
And the second feature stitching unit is used for performing feature stitching on the resource name feature information, the release attribute feature information, the resource serial number feature information and the image feature information to obtain the multi-mode feature information.
12. The apparatus of claim 9, wherein the resource aggregation module comprises:
The resource classification unit is used for respectively inputting the semantic fusion characteristic information corresponding to each media resource into a resource classification model to classify the resources and determining the resource category to which each media resource belongs;
And the aggregation processing unit is used for carrying out aggregation processing on the media resources belonging to the same resource category in the plurality of media resources to obtain the plurality of media resource aggregate.
13. The apparatus of claim 9, wherein the distribution attribute information comprises: the release time, the apparatus further comprises:
The text repetition analysis module is used for performing text repetition analysis on any two media resources in each media resource collection to obtain text repetition results corresponding to the any two media resources;
a time interval determining module, configured to determine a time interval between release times corresponding to the two media resources;
a length difference determining module, configured to determine length difference information between resource lengths corresponding to the arbitrary two media resources;
and the resource de-duplication module is used for carrying out resource de-duplication processing on each media resource collection based on the text repetition result, the time interval and the length difference information to obtain a de-duplication set corresponding to each media resource collection.
14. The apparatus of claim 9, wherein the resource sequence number information comprises: the sequence generation module comprises a first-level sequence number information and a second-level sequence number information, wherein the sequence generation module comprises:
The aggregation splitting unit is used for splitting each media resource aggregation based on the first-level serial number information corresponding to the media resources in each media resource aggregation to obtain a plurality of sub-aggregations corresponding to each media resource aggregation, wherein the first-level serial number information corresponding to the media resources in each sub-aggregation is the same;
And the sub-collection resource ordering unit is used for ordering the media resources in each sub-collection based on the secondary sequence number information corresponding to the media resources in each sub-collection, and generating a media resource sequence corresponding to each sub-collection.
15. The apparatus of claim 9, wherein the multi-modal information further comprises: the resource comment text information, the apparatus further comprising:
a common resource segment determining module, configured to determine a common resource segment of the media resources in each media resource aggregate;
the segment text extraction module is used for extracting the text of the public resource segment to obtain public segment text information;
The resource name extraction module is used for extracting resource names of the resource name information, the resource comment text information and the public segment text information of the media resources in each media resource collection to obtain a plurality of name information;
And the aggregate name determining module is used for taking the name information with highest co-occurrence frequency in the plurality of name information as aggregate name information corresponding to each media resource aggregate.
16. The apparatus according to any one of claims 9 to 15, further comprising:
The target collection name information determining module is used for responding to a resource searching request carrying target searching words and determining target collection name information matched with the target searching words;
And the resource sequence display module is used for displaying the media resource sequence corresponding to the target collection name information.
17. A media resource aggregation apparatus, characterized in that the apparatus comprises a processor and a memory, in which at least one instruction or at least one program is stored, which is loaded and executed by the processor to implement the media resource aggregation method according to any one of claims 1 to 8.
18. A computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the media asset aggregation method of any one of claims 1 to 8.
19. A computer program product comprising at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the media resource aggregation method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410635095.8A CN118227910B (en) | 2024-05-22 | 2024-05-22 | Media resource aggregation method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410635095.8A CN118227910B (en) | 2024-05-22 | 2024-05-22 | Media resource aggregation method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118227910A CN118227910A (en) | 2024-06-21 |
CN118227910B true CN118227910B (en) | 2024-08-20 |
Family
ID=91507671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410635095.8A Active CN118227910B (en) | 2024-05-22 | 2024-05-22 | Media resource aggregation method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118227910B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112231275A (en) * | 2019-07-14 | 2021-01-15 | 阿里巴巴集团控股有限公司 | Multimedia file classification, information processing and model training method, system and equipment |
CN117648504A (en) * | 2022-08-15 | 2024-03-05 | 腾讯科技(深圳)有限公司 | Method, device, computer equipment and storage medium for generating media resource sequence |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111159435B (en) * | 2019-12-27 | 2023-09-05 | 新方正控股发展有限责任公司 | Multimedia resource processing method, system, terminal and computer readable storage medium |
CN112818906B (en) * | 2021-02-22 | 2023-07-11 | 浙江传媒学院 | An intelligent cataloging method for all-media news based on multi-modal information fusion understanding |
CN116450860A (en) * | 2023-03-08 | 2023-07-18 | 清华大学 | Media resource recommendation method, recommendation model training method and related equipment |
CN117789680B (en) * | 2024-02-23 | 2024-05-24 | 青岛海尔科技有限公司 | Method, device and storage medium for generating multimedia resources based on large model |
-
2024
- 2024-05-22 CN CN202410635095.8A patent/CN118227910B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112231275A (en) * | 2019-07-14 | 2021-01-15 | 阿里巴巴集团控股有限公司 | Multimedia file classification, information processing and model training method, system and equipment |
CN117648504A (en) * | 2022-08-15 | 2024-03-05 | 腾讯科技(深圳)有限公司 | Method, device, computer equipment and storage medium for generating media resource sequence |
Also Published As
Publication number | Publication date |
---|---|
CN118227910A (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112131350B (en) | Text label determining method, device, terminal and readable storage medium | |
CN107679039B (en) | Method and device for determining statement intention | |
CN111488931B (en) | Article quality evaluation method, article recommendation method and corresponding devices | |
CN110929098B (en) | Video data processing method and device, electronic equipment and storage medium | |
KR101754473B1 (en) | Method and system for automatically summarizing documents to images and providing the image-based contents | |
CN111026861B (en) | Text abstract generation method, training device, training equipment and medium | |
CN117493491A (en) | Natural language processing method and system based on machine learning | |
CN112085120B (en) | Multimedia data processing method and device, electronic equipment and storage medium | |
CN111539197A (en) | Text matching method and device, computer system and readable storage medium | |
CN114240552A (en) | Product recommendation method, device, equipment and medium based on deep clustering algorithm | |
CN115131698B (en) | Video attribute determining method, device, equipment and storage medium | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN112131345B (en) | Text quality recognition method, device, equipment and storage medium | |
CN114416995A (en) | Information recommendation method, device and equipment | |
CN113392179A (en) | Text labeling method and device, electronic equipment and storage medium | |
CN114282528A (en) | Keyword extraction method, device, equipment and storage medium | |
CN113705191A (en) | Method, device and equipment for generating sample statement and storage medium | |
CN116186220A (en) | Information retrieval method, question and answer processing method, information retrieval device and system | |
CN113408282B (en) | Method, device, equipment and storage medium for topic model training and topic prediction | |
CN115269781A (en) | Modal association degree prediction method, device, equipment, storage medium and program product | |
CN114519397A (en) | Entity link model training method, device and equipment based on comparative learning | |
CN118035945A (en) | Label recognition model processing method and related device | |
CN118227910B (en) | Media resource aggregation method, device, equipment and storage medium | |
CN115269961A (en) | Content search method and related device | |
CN117271877A (en) | Object recommendation method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |