CN120730095B

CN120730095B - OTT content operation optimization method based on multi-mode perception and agent decision

Info

Publication number: CN120730095B
Application number: CN202511156433.0A
Authority: CN
Inventors: 崔峥; 郑卫飞; 钟军
Original assignee: Hangzhou Hua Zhi Screen Information Technology Co ltd
Current assignee: Hangzhou Hua Zhi Screen Information Technology Co ltd
Priority date: 2025-08-19
Filing date: 2025-08-19
Publication date: 2025-12-05
Anticipated expiration: 2045-08-19
Also published as: CN120730095A

Abstract

This invention relates to the field of image communication, specifically to an OTT content operation optimization method based on multimodal perception and intelligent agent decision-making. The method includes synchronously sampling the baseband video signal output by the OTT client to present an interactive guide; parsing the component content of the screen image through multimodal analysis; constructing a business model representing service access points and transition relationships through navigation engine detection; and generating a strategy evaluation report on the presentation method and content distribution efficiency of television content through in-depth analysis and historical comparison of the business model. This invention proposes the participation of multimodal and intelligent agents in the operation process, improving the efficiency of automated acquisition and large-scale analysis of video client content through in-depth acquisition of video images and comprehensive summarization of text.

Description

OTT content operation optimization method based on multi-mode perception and agent decision

Technical Field

The invention relates to the field of image communication, in particular to an OTT content operation optimization method based on multi-mode perception and agent decision.

Background

In an interactive video system, the system architecture typically includes a headend system that is responsible for managing and distributing content, and a large number of client devices for receiving and presenting content. On the client device, the system presents a Graphical User Interface (GUI) through which the user interacts with the video service. At the heart of this GUI is typically an Interactive Program Guide (IPG) that presents the user with available video resources in the form of hierarchical menus, sorted lists, content recommendations, and the like. Therefore, the interface layout, content organization and dynamic changes of the IPG are not only key to the service core user experience, but also directly reflect the content supply status at a specific point in time. When a content presentation of a certain video service needs to be systematically cataloged or analyzed, the prior art mainly relies on manual operation, an operator needs to simulate a terminal user, manually navigate a GUI interface of a client by using a device such as a remote controller, and record the content thereof for subsequent analysis in a screenshot or video recording mode.

In the prior art, when the automatic analysis of the content of the video service client is realized, the technical bottleneck is faced, namely, firstly, the information black box problem of the client interface is that the original data structure of the structured content information in the service head end is lost after the structured content information is encoded, transmitted and finally rendered into the visual image on the screen by the client equipment from the technical flow. Externally, what is perceived is simply the final pixel stream, and the data structure behind it cannot be directly accessed or queried, making the client interface a "black box" where data exploration cannot be directly performed. Moreover, the analysis mode lacks the capability of large-scale and systematic comparison, and can not carry out transverse comparison and differential analysis on operation contents, column hotspots and main pushing resources of different OTT platforms or television manufacturers. Furthermore, it is difficult in the prior art to understand accurately and comprehensively the complex layout, the content of the mixed-arrangement of graphics and texts on the television screen and the underlying operational meaning.

Therefore, an OTT content operation optimization method based on multi-mode perception and agent decision is provided.

Disclosure of Invention

The invention aims to provide an OTT content operation optimization method based on multi-mode perception and agent decision so as to solve the problems in the background technology.

In order to achieve the above purpose, the invention provides the technical scheme that the OTT content operation optimization method based on multi-mode perception and agent decision comprises the following steps:

Sampling and digitizing a baseband video signal output by an OTT network television client, wherein the baseband video signal is used for presenting visual contents of an interactive program guide;

analyzing two-dimensional pixel information in the baseband video signal into a screen content set containing service information and content metadata through a multi-mode model, generating and sending a navigation instruction signal to an OTT network television client through an automatic exploration engine, and synchronously sampling the baseband video signal returned subsequently to construct a complete interaction sequence;

Correlating and aggregating a plurality of interactive sequences obtained through cyclic exploration to generate a television service model of a service access point and a transition relation thereof;

And carrying out strategy generation by a decision analysis intelligent body, finally assembling all analysis, comparison and generation results, and outputting a model data and a strategy evaluation report.

Preferably, the specific implementation process for sampling and digitizing the baseband video signal of the OTT network television client includes:

And the OTT network television client generates a baseband video signal, and generates a static video frame representing a terminal to display an interactive program guide picture in real time by digital conversion of the video acquisition equipment.

Preferably, the specific implementation process of parsing the baseband video signal into the screen content set through the multi-mode model includes:

The method comprises the steps of inputting a static video frame into a multi-mode model, recovering and extracting an image component which is encoded into visual elements in the process of rendering a client from a pixel domain of a television picture, identifying and classifying the graphic elements carrying service information in the picture, calculating and outputting position information in a screen coordinate system for the identified graphic elements, executing optical character identification for text elements, extracting character string content, and compiling the classification, character string content and position information of all elements into a screen content set describing the display content of the static video frame.

Preferably, the implementation process of constructing a complete interaction sequence through the automatic exploration engine includes:

The method comprises the steps of enabling an automatic exploration engine to conduct an exploration process, analyzing a screen content set in the exploration process, identifying graphic elements of a switching page, adding the graphic elements into a queue to be explored, generating an interactive instruction by the exploration engine based on a predefined television service navigation strategy, sending the interactive instruction to a control executor, simulating to send out a navigation instruction for changing the state of a television client through a physical remote controller device, synchronizing with subsequent video frame sampling, completing systematic data acquisition, and constructing a complete interactive sequence.

Preferably, the specific implementation process of the television service model for generating the service access point and the transition relation thereof through circulation comprises the following steps:

After the exploration process, an analysis process is carried out, a state signature is generated based on the interaction sequence, the state signature is used for marking the interactive program guide to be a service access point, when a navigation instruction causes the change of the presentation content from the initial service access point to the target service access point, the directional transition relation between the initial service access point and the target service access point is established and recorded, and the cyclic exploration and analysis process is used for constructing all the discovered service access points and the transition relation thereof into a visual television service model.

Preferably, the process of performing deep parsing on the television service model includes:

the method comprises the steps of applying a shortest path algorithm, calculating the minimum interactive steps from an access service access point to key television content assets and service access points, evaluating discoverability and navigation efficiency of paid contents and value added services, scoring and sequencing importance of all service access points by adopting a centrality measurement method, identifying a core content aggregation page and a navigation distribution hub, running a community discovery algorithm, dividing a television service model into different service logic domains, reflecting service packaging and binding strategies of the television service, and aggregating text contents attached to all service access points.

Preferably, the process of performing time domain difference ratio pair between the television service model and the plurality of history models includes:

The method comprises the steps of selecting a television service model of a current period and a plurality of television service models of historical periods as inputs of differential analysis, executing relevance comparison flow of service access points, establishing one-to-one correspondence of the service access points between the current and historical models based on each state signature, comparing matched service access point pairs one by one, identifying and recording changes of internal attributes, identifying non-corresponding service access points as new added and deleted, identifying and recording new added, deleted and redirected navigation paths caused by change of television service interaction logic based on the established service access point correspondence, calculating and quantifying differences of service menu structures, content entry positions and channel arrangement sequences in two maps, evaluating periodical adjustment of content arrangement strategies, and compiling all changes into a set of time-sequential content and navigation change records.

Preferably, the specific implementation process of outputting a model data and an evaluation report of the strategy includes:

The method comprises the steps of carrying out deep fusion on various television program performance indexes obtained in the deep analysis and content obtained in the time domain difference comparison and navigation change records to form aggregation records representing platform operation dynamics, starting content strategy analysis agents endowed with preset market targets and content arrangement criteria, carrying out comprehensive research and judgment on the aggregation records, identifying the fit degree and difference points between the current platform content strategy and market operation hot spots, automatically generating content arrangement suggestions of next period content operation based on the fit degree and the difference points, wherein the content arrangement suggestions comprise recommendation of key content types, optimization of popularization resource bits and response to emerging trends, carrying out logic assembly on the depth analysis, the time domain comparison and the content arrangement suggestions to form explanatory text description, calling a graphical engine, generating a trend chart by changing the performance indexes of content distribution efficiency, and outputting typeset contents as static report files, wherein the static report integrates content operation state analysis and future strategy guidance.

Compared with the prior art, the invention has the beneficial effects that:

1. In terms of automation and scale, the automatic browsing of the HID device is controlled through the multi-mode model, so that a mode relying on manual browsing and monitoring is changed into a mode of acquiring operation content automatically and almost in real time. The method can comprehensively and synchronously scan and catalog the content operation state of the OTT platform in a short time, and has greatly improved efficiency and coverage compared with the prior art, thereby laying a foundation for realizing systematic industry competition analysis.

2. In the aspect of semantic understanding, deep semantic analysis is carried out on a video frame by applying a multi-modal model, a service state signature with robustness on dynamic visual content is generated, screen pixels are accurately restored to be structural representations containing business logic, and unique interface states can be stably identified under the interference of dynamic content such as advertisements, corner marks and the like. The accuracy, consistency and depth of the analysis result are obviously improved, the visual surface layer can be penetrated, and the content arrangement and navigation logic behind the interface can be effectively observed.

3. In the decision stage, through carrying out deep topology analysis on the constructed business navigation network model, introducing strategy formulation intelligent agents endowed with specific business targets, carrying out comprehensive research and judgment and prospective planning, and improving the analyzed dimension from the description of objective states to the guidance of future strategies. The method can automatically evaluate the effectiveness of content arrangement, generate optimization guidance comprising content type recommendation and popularization scheme, and endow content operation data-driven and quantifiable decision support capability.

Drawings

FIG. 1 is a flowchart of an OTT content operation optimization method based on multi-modal awareness and agent decision according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-modal analysis structure according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an interaction process for generating an evaluation report according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, the method for optimizing OTT content operation based on multi-modal sensing and agent decision according to the present invention comprises the following specific implementation steps:

sampling and digitizing a baseband video signal output by an OTT network television client and presenting visual content of an interactive program guide;

Correlating and aggregating the analyzed interaction sequences through circulation to generate a television service model of the service access point and the transition relation thereof;

The method comprises the steps of carrying out deep analysis on an internal structure of a television service model, carrying out time domain difference ratio pair with a topological model stored in a history period, carrying out strategy generation by a decision analysis agent, carrying out final assembly on all analysis, comparison and generation results, and outputting an evaluation report of content navigation efficiency and presentation strategy.

The technical scheme of the invention is further described in detail below with reference to specific embodiments.

Example 1

The embodiment of the application discloses an OTT content operation optimization method based on multi-mode perception and agent decision, and referring to FIG. 1, a central analysis server identifies an optimization process of manufacturer operation content, and the specific implementation steps of the method comprise S1, sampling and digitizing a baseband video signal which is output by an OTT network television client and presents visual content of an interactive program guide, S2, analyzing two-dimensional pixel information in the baseband video signal into a screen content set containing service information and content metadata through a multi-mode model, S3, generating and sending a navigation instruction signal to the OTT network television client through an automatic exploration engine, synchronously sampling a baseband video signal which is returned subsequently, constructing a complete interaction sequence, S4, correlating and aggregating a plurality of interaction sequences which are obtained through circulation, generating a television service model of a service access point and a transition relation thereof, S5, analyzing the internal structure of the television service model deeply, carrying out time domain difference ratio comparison with a historical television service model, S6, generating all decision analysis strategies through an intelligent agent, and finally compiling strategy, and outputting strategy evaluation results.

Further, the method comprises the steps of sampling and digitizing a baseband video signal which is output by an OTT network television client and presents visual content of an interactive program guide, wherein the specific implementation process comprises the following steps of:

Specifically, firstly, a physical communication link is established between an HDMI output port of an OTT client and an HDMI input port of a video acquisition card or equipment with the same specification in a video acquisition subsystem through a high-speed data line conforming to HDMI2.0 standard. When the OTT client is running, it generates a standard, unencrypted baseband video signal, accurately presenting any interactive program guide or other user interface displayed on its screen. The video capture card intercepts this signal. The chip set inside the acquisition card performs analog-to-digital conversion and hardware compression to digitize the continuous video signal in real time into a series of discrete still video frames. For example, for a 4K signal source, the acquisition card may capture at 3840x2160 resolution, 60fps, or downsampled to 1920x1080 resolution, 60fps in order to reduce the subsequent processing load. These digitized video frame data are transmitted at high speed to a central analysis server via a high bandwidth interface.

The video acquisition subsystem offloads the computationally intensive video codec tasks from the central analysis server, enabling valuable computing resources to be devoted to the image analysis tasks of the execution core. The design not only realizes platform independence, but also optimizes the resource allocation and the processing efficiency of the whole system.

Further, the two-dimensional pixel information in the baseband video signal is analyzed into a screen content set containing service information and content metadata through a multi-mode model, and the specific implementation process comprises the following steps of:

Each frame of video image is input into the central analysis server, first with a zero sample object detection model Grounding DINO, identifying bounding boxes for all potential UI components in the video frame. These components include video thumbnails, text blocks, buttons, icons, and the like. By segmenting Model SEGMENT ANYTHING Model, a segmentation mask is generated for each detected component that is accurate to the pixel level, each component image is segmented out, fed into a multi-modal large language Model MLLM, components are classified, e.g., identified as "buttons," "icons," "text labels," etc., using an understanding of the UI design pattern, and potential functionality is inferred. For a component containing text information, an OCR engine is invoked to extract the content of the character string it contains. Finally, the classification of all components, the text content and the exact location information in the screen coordinate system, i.e. the bounding box coordinates, are stored in the screen content collection.

Unstructured, purely visual data in video frames is successfully converted into a structured, semantic, machine-readable format by a multimodal model. The reverse engineering is effectively performed on the UI structure, so that the subsequent automation system can understand and operate the interface in a meaningful way, and the UI structure has higher robustness and universality. And can be stably recognized and interacted with as long as the visual presentation remains consistent regardless of changes in the underlying code of the application.

Further, generating and sending a navigation instruction signal to the OTT network television client through an automatic exploration engine, and synchronously sampling a baseband video signal returned subsequently to construct a complete interaction sequence, wherein the specific implementation process comprises the following steps of:

The exploration engine parses the screen content collection, identifies all elements categorized as "activatable" or "interactable" and adds them to a "waiting to explore queue". A target element is selected from the queue based on a predefined breadth-first search. The engine generates a navigation instruction based on the selected target element. For example, for a button located at coordinates, the instruction may be "perform click operation at coordinates". The instruction is sent to a control executor which simulates infrared and Bluetooth signals of a physical remote controller or directly drives interface elements of an App at a software level by using a UI automation framework, and the instruction is sent to an OTT client, so that the instruction can directly interact with UI controls, is not influenced by resolution, layout and coordinate changes, and can acquire the attribute of any UI element more accurately to verify results. The central analysis server synchronizes this operation with the video acquisition subsystem. The screen state before the operation is performed, the operation itself, and the new state appearing on the screen after the operation are recorded. Such an operation group constitutes a complete interaction. By linking these interactions in time order, a complete interaction sequence is constructed.

The process of manually exploring application UI maps, which is extremely time-consuming and labor-consuming, is fully automated by the exploration engine. It can explore all possible user tours in detail and reproducibly, at speeds and breadth far exceeding human testers. The resulting large-scale UI interaction data set is incomparable with manual testing in terms of its comprehensiveness and systemicity.

Further, a television service model of the service access point and the transition relation thereof is generated through circulation, and the specific implementation process comprises the following steps of:

For each unique screen interface encountered during the exploration process, the central analysis server generates a unique "status signature" for it. The method is specifically implemented by carrying out deterministic sorting on all elements in a screen content set according to screen positions, and then splicing the sorted element types, text contents and coordinates into a unique and normalized character string. This string itself is the signature of the service state. The central analysis server identifies each unique service state signature as a service access point and for each recorded interaction, the central analysis server creates a transition relationship between the service access point representing state a and the service access point representing state B. The transition relation is endowed with time weight spent by switching from the state A to the state B in synchronous sampling, so that the fuzzy subjective experience of 'feeling fast' or 'feeling stuck' of a user is converted into an objective performance index which can be accurately measured, compared and tracked, the subsequent operation and maintenance are more convenient, and the optimization process is more visual. The process is circularly executed until the queue to be explored is empty, and all discovered service access points and transition relations thereof are constructed into a complete visual television service model, which is specifically a directed graph g= (V, E), wherein V is a set of service access points and E is a set of transition relations.

This step converts thousands of linear, independent interaction sequences into a single, monolithic, structured model. The method and the system enable complex and nonlinear user operation to be analyzed, and can reveal deep relations among all screen interfaces in the service, so that the whole team can work cooperatively based on the same model.

Further, the method comprises the following steps of corresponding to the step S5, wherein the specific implementation process comprises the following steps of:

The central analysis server applies Dijkstra's algorithm on the model to calculate the minimum number of interaction steps required from a specified portal service access point, e.g., an application home page, to each key service access point, e.g., a particular movie classification, a pay-per-view content, or a user subscription page. The Betweenness Centrality algorithm is performed on all service access points in the graph. Those service access points that are located on the shortest path between the largest number of other service access point pairs are identified. Service access points with high centrality scores are key navigation hubs or potential flow bottlenecks in the application. For example, a search results page or main menu page will typically have a very high centrality score.

The Louvain algorithm is operated on the model to divide the model into a plurality of service access point clusters, and the connection between the service access points in the clusters is much tighter than the connection between the clusters. These automatically discovered communities often correspond directly to logically independent business domains in OTT services, possibly including "live tv", "video on demand", "juvenile area" or "account management", etc. The central analysis server aggregates all text content extracted by OCR from the various service access points in the graph. By frequency analysis of these texts, the content "hot spots" on the platform, i.e. the programs, types, actors or services most frequently promoted in the whole service, are identified.

The abstract model is converted into a set of objective and quantifiable performance indexes through analysis of the internal structure. The subjective evaluation was replaced with accurate data. This enables operators to make data-driven decisions, accurately identify pain points in the user experience, and understand in depth the actual functional architecture and functional layout of their applications.

Further, the implementation process of performing time domain difference ratio with the historical period model comprises the following steps:

And carrying out one-to-one service access point matching between the current model and the historical model by using a unique state signature of each service access point. Service access points that exist in the current model but do not exist in the history model are labeled "newly added screen".

Service access points that exist in the history model but do not exist in the current model are labeled "delete screen".

For a service access point pair that successfully matches initially, the central analysis server will compare its internal attributes, text content extracted by OCR to detect "content changes", e.g., movie thumbnails replaced on the same page layout. Based on the established service access point correspondence, the central analysis server further compares the transition relationships of the models. If there is a transition relationship between two matching service access points in the new model, but not in the old model, this is identified as a new path. If there is a transition relationship in the old model and not in the new model, this is identified as a removal path. If the target service access point of the transition relationship from a matching service access point changes, this indicates a redirect path. The central analysis server quantifies these differences and compiles them into a log of time-series change records.

An automated, longitudinal audit trail record is provided with respect to platform evolution by comparison. Replaces the tedious work of manually tracking UI changes and provides a powerful tool for decision analysis. The operator can use this record to correlate specific UI modifications with changes in the key performance indicators.

Further, policy generation is performed by the decision analysis agent, all analysis, comparison and generation results are finally assembled, and an evaluation report of content navigation efficiency and presentation policy is output, wherein the specific implementation process comprises the following steps of:

One decision analysis agent initiates the process responsible for integrating the output from the deep resolution and time domain analysis. A decrease in navigation efficiency is associated with a particular path redirection event. And the agent is given specific business objectives and content orchestration criteria such as "maximize user engagement of original content" or "shorten the purchase path of video on demand". According to the targets, the fused data is researched and judged to identify the degree of fit and deviation between the current platform strategy and the set targets. Based on the result of its research, the agent automatically generates specific, operational content orchestration suggestions for the next operation cycle. These suggestions might include "suggest to promote a 'new science fiction episode' to the homepage carousel graph in response to the current hot spot trend", or "suggest to add a direct link to the 'my subscription' in the user avatar drop down menu, shortening the navigation path from 4 clicks to 2). A natural language generation module receives structured analysis data such as metrics, logs, suggestions, and converts it into coherent, human-readable narrative text. The process follows multiple phases, content planning, sentence aggregation, and grammar structuring. At the same time, a visualization engine generates charts of quantized data, such as trend graphs of navigation efficiency, pie charts of community scale, and the like. Finally, the text and visual charts are assembled and laid out into a static report file or interactive dashboard. This report and dial integrates the deep analysis of the current state of operation of OTT platforms and clear guidance for future strategies. On the dashboard, the user can dynamically screen data and compare models of different time periods, and even simulate path changes caused by certain UI changes, compared with static files, the method is more visual for the current operation situation, and the decision advice is understood more quickly.

This step fully automates the entire process from analysis to strategy, not only presenting complex analysis results to the decision maker in an easy-to-understand manner, but also actively generating operability suggestions, bridging the gap between raw data and business decisions. This greatly reduces the cognitive burden on the operators and significantly speeds up the data driven platform optimization iteration cycle.

The true value of the whole set of process is that a complete and automatic decision-making cycle is constructed for OTT operation, wherein 'observation' is carried out through video acquisition, 'positioning' is carried out through modeling and analysis, 'decision-making' is carried out through advice generated by an agent, and finally 'action' is carried out by an operator according to a report. The invention automates the first three stages and greatly improves the deployment agility of OTT service providers.

Example two

In the process of converting visual signals into structured data, referring to fig. 2, the following is a specific embodiment:

the video acquisition device receives the signal, and its internal analog-to-digital converter and field programmable gate array sample the video signal at a frequency of 60 times per second, converting the analog signal of each frame into digital pixel data. These data are encoded in real time, generating a series of high-definition still video frames representing the instantaneous picture of the terminal screen.

The central analysis server may prompt Grounding DINO a set of text for "a button", "a picture", "a text", "an icon", and "an input box". Grounding DINO will understand the meaning of these words and the content of a complete video frame at the same time, then draw a frame around the "immediately subscribed" button on the picture using the bounding box, draw a frame on the poster of the current red tv show, and draw a frame on the words "action and adventure". And taking each boundary box generated by Grounding DINO as a prompt, and sequentially inputting the prompt into the SAM. For the rough bounding box of the "subscribe immediately" button, the SAM will accurately identify the rounded corners, edges of the button and generate a segmentation Mask (Mask) containing only the button pixels. The Mask is only segmented out of the button itself on the original. Likewise, an exact rectangular mask is generated for the poster of the current red television episode and a mask of tightly wrapped text is generated for the text. Each of the divided UI elements is sent MLLM to be analyzed, when MLLM sees "immediately subscribe" to that image block, its visual features (rectangle, color fill, shadow) and internal text patterns are analyzed, and together with its huge UI knowledge base, a semantic label is applied to each UI element, which is a button that can trigger subscription actions, and for the poster map of the red tv show, it is classified as a picture thumbnail, and for the text of "action and adventure", it is classified as a text label. The "immediately subscribed to" "action and adventure" recognized by MLLM as containing text is entered into the OCR engine to recognize the character therein. The central analysis server outputs information to assemble, and for the button "immediately subscribe", a data record { id:3, type: button, text: immediately subscribe, { x1:450, y1:820, x2:630, y2:880}, mask } is generated as an entry in the screen content set.

In preparation for exploring the whole application, the automated exploration engine first analyzes the screen content set of the "application front page", identifies all clickable elements "TV show channel" buttons, "movie channel" buttons, "search" icons, places these clickable elements into a "waiting for exploration queue" [ "TV show channel" buttons, "movie channel" buttons, "search" icons ], and executes breadth-first policies. The first page is divided into a 0 th layer, the television channel page, the movie channel page and the search page are divided into a 1 st layer, all pages which are extracted from the page and continue to arrive are divided into a 2 nd layer, and after all the pages of the 1 st layer are explored from the 0 th layer, any page of the 2 nd layer can be explored. The exploration engine first fetches the "televised channel" button from the head of the "exploration queue". For a box based on android, a shell command sent through an Android Debug Bridge (ADB) is sent, for other systems, a hexadecimal infrared code is possible, a USB infrared emitter connected to an analysis server is used for simulating a remote controller to send accurate infrared light signals, and navigation instructions are sent to an OTT client. At the moment the navigation instruction is sent, the video acquisition device starts to continuously capture video frames output by the OTT client at 60fps, and the central analysis server continuously compares two adjacent frames. When the continuous 30 frames of content is found to be identical, the screen is judged to have completed all the rendering and loading, a stable state is entered, and the content of the last stable picture is used as the final result of the execution of the navigation instruction. The central analysis server gathers all key information of the current interaction together to form an interaction sequence { initial state signature, navigation instruction, target state signature, time }. After the central analysis server completes and stores the interaction sequence, the central analysis server returns to the first step again, takes out the next target 'movie channel' button from the 'exploration queue', repeats the whole process and continuously generates the interaction sequence.

Example III

In the process of constructing the service model, the specific implementation modes are as follows:

The automatic search index engine enters an application home page, and after all animations and main contents are loaded, a structured screen content set is obtained through a multi-mode visual model, wherein an element A is: { a type: a carousel image, contents are: (jpg, coordinates are: [100,100,1820,600] }, an element B is: { a type: an icon button, contents are searched, coordinates are: [50,950,150,1050] }, an element C is: { a type: an icon button, contents are My, and coordinates are: [1770,950,1870,1050] }. The central analysis server sorts all elements in the screen content set according to the rule of 'longitudinal before transverse', and the key information (type, text content and coordinates) of each element is spliced into a long character string according to the sorted sequence, and the elements are separated by semicolon. This string is the status signature of the home page in a particular status. The carousel graph is used as a dynamic area to be marked so as to ensure the relative stability of the signature of the status of the first page.

This unique signature is defined as a service access point, and is marked in the business model as an "application home" exploration engine that interacts with the interactable elements on the home page one by one and records each complete "interaction sequence". The central analysis server simulates clicking element B on the home page. After 420 milliseconds, the interface stabilizes on the "search page", creating a transition relationship from the home page to the search page, with the attribute { home page signature, click element B, search page signature, 420ms }. The central analysis server returns to the home page again, clicks element C, reaches the "user center page" after 680ms, and creates a transition relationship from the home page to the user center with the attribute { home page signature, click element C, user center page signature, 680ms }. Through the process, from the single service access point of the application home page, a plurality of transition relations with different time weights pointing to different functional pages are radiated, and a star-shaped network structure centered on the application home page SAP is formed initially. The automated exploration engine will continue to access the "search page" and "user center page" in sequence and generate status signatures on these pages, defined as new service access points, and then explore all links on them, creating new transition relationships. When the exploration process is continued, the original star-shaped structure with the home page as the center can be developed into a complex and interweaved huge network. This network is the final model of television service.

Example IV

In the process of analyzing and comparing television service models, referring to fig. 3, the specific implementation manner is as follows:

The Dijkstra algorithm creates a distance table that records the distance from the home page to all other service access points. The distance from the home page to the home page is set to 0, and the distances from all other service access points are set to infinity. Starting from the first page, all its directly connected service access points are examined. The distance to these service access points is updated to 1. Next, from all the service access points of known distance, a "movie channel" whose distance is shortest and which is not completely processed is selected. And starting from the movie channel, all service access points are examined. If it is connected to the VIP purchase page, the distance to the purchase page is updated to 1, the algorithm will repeat this process, expanding out layer by layer, continuing to update the distance table until the shortest distance to the target VIP purchase page is calculated, and the number is output.

The Betweenness Centrality algorithm would theoretically traverse all possible pairs of service access points in the graph, calculate the shortest path for each pair, and find all possible shortest paths between them. The algorithm then examines this search results page that we are currently analyzing, looking at how many times it has occurred in these shortest paths, and marks the sum of probabilities that occur on the shortest paths between all pairs of service access points as a centrality score.

Initially, each service access point belongs to its own independent community, the Louvain algorithm traverses each service access point, attempts to move it from the current community to the community where its neighbors are located, and calculates the gain this brings to "modularity". The service access point is then moved to the neighborhood with the greatest modularity gain according to the greedy concept. And the process is repeated for several rounds until the overall modularity is no longer improved by the movement of any of the service access points. Each stable community formed at this stage is aggregated into a new "super service access point". Based on these super service access points, a completely new, coarser granularity network is constructed. The movement is then repeated again on this new network. The algorithm will eventually output a stable community partitioning scheme, assigning all service access points to different clusters. These automatically discovered communities tend to be highly consistent with the actual logical partitioning of OTT traffic. For example, the algorithm may automatically identify community A (video on demand domain) as containing a top page, movie/television channel, detail page, play page, etc. Community B (account management domain) contains "my" pages, login pages, order history, setup pages, etc. Community C (pediatric private area) contains all pages related to pediatric content.

The central analysis server traverses each service access point in the business model, extracts all texts extracted by OCR in the screen content set, such as program names, actors, brief introduction, button characters, popularization languages and the like, and gathers the texts into a huge text corpus. Preprocessing the corpus to remove nonsensical stop words and cutting sentences into independent words or phrases. And finally, calculating the total number of times each word or phrase appears in the whole corpus. A list of "hotwords" ordered from high to low frequency is output.

During the comparison process, the central analysis server traverses each service access point in the current model and attempts to find service access points in the historical model that possess identical state signatures. If a service access point's status signature exists in the current model but does not exist in the history model, it will be marked as "newly added screen". Conversely, if a service access point's status signature exists in the history model but does not exist in the current model, it will be marked as "delete screen". Since the status signature is highly sensitive to the content, even if the overall layout of a page is unchanged, the status signature will change as long as the promotional content above changes. The movie channel pages in the history model have their status signatures containing the text "hot recommendation" extracted by OCR: movie 1 "when red. The operator in the current model changes the recommendation bit to a new one in the week, and its status signature now contains the text "popular recommendation: movie 2" when red ". The central analysis server determines that the movie channel page in the history model is a "delete screen" and that the movie channel page in the current model is a "new screen". But by further analyzing the similarity of the two signatures they were found to be highly similar. Thus, in the final change log, the central analysis server will record it as a "content change" in which the old movie channel page is replaced with a new, updated version of the content.

If in the current model there is a transition relationship between two matching service access points, but in the history model there is no such relationship, then this is identified as a "new path", meaning that a new navigation entry or shortcut is added. Conversely, if there is a transition relationship in the history model and the corresponding relationship in the current model disappears, it is identified as a "removal path". In the history model, the transition relationship from my account page points to the old order history page. In the current model, the transition relationship from the same my account page now points to a completely new membership subscription management page. The comparison result shows that the central analysis server determines that the user is a redirection path, and accurately describes that the behavior of the user is unchanged, but the result caused by the behavior is radically changed.

The central analysis server will first make a quantitative statistic of all changes, forming a high summary. In addition to the macroscopical statistics, the central analysis server also generates a detailed, itemized log of time-series changes. Each record contains the type of change, a signature of the service access point or transition relation involved, and possibly a context description.

Example five

In the process of agent decision and report generation, the specific implementation modes are as follows:

All data obtained by analyzing and comparing the television service model are input into the intelligent agent, and the intelligent agent discovers that the navigation depth of the VIP purchase page is deteriorated from 3 steps to 5 steps in the week. In the change log, the agent finds a record describing a button on the "movie channel page" that originally points to the VIP purchase page, and is now redirected to a general "member activity center" page. The agent correlates these two independent findings, creating a strong causal assumption that the decrease in navigational efficiency is most likely caused by a specific path redirection event.

The agent, when instantiated, is given specific business objectives of "shorten the purchase path of video on demand" and content orchestration criteria of "maximize user engagement of original content". The agent compares the resulting "causal links" to these preset criteria, and compares the inferred fact that the VIP purchase page path is lengthened to business objectives. A serious deviation is identified, namely the current platform strategy runs counter to the established business objective, and a structural proposal is automatically generated by the intelligent agent for the deviation, namely the proposal is used for recovering a direct navigation link pointing to a VIP purchase page in a 'movie channel page', and the aim is to shorten the navigation path from the current 5 clicks to within 3 clicks after optimization. The text analysis shows that the new science fiction episode is a full-platform hot topic, but the navigation analysis shows that the entrance is buried deeply. This is an opportunity to be amplified, consistent with the goals of the content orchestration criteria. A suggestion is generated for this to "suggest to raise the ' new science fiction episode ' thematic portal to the carousel view or buddha's warrior of ' application homepage ' to respond to the current hot spot trend and maximize its user engagement.

A natural language generation module receives all indexes, logs and suggestions from the preamble stage, analyzes specific influences according to the key changes summarized first, and finally gives out the sequence of optimization suggestions for generation. Meanwhile, a visualization engine can call a chart library, historical data of navigation efficiency is generated into a trend chart, a change curve of VIP purchasing path depth is visually displayed, community division results of a Louvain algorithm are generated into a community-scale pie chart, and a hot word list is generated into a word cloud chart. All the generated narrative text and visual charts are fed into a typesetting engine. And the engine performs professional layout and formatting on the content according to the report template, and adds brand elements such as corporate logo and the like.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The OTT content operation optimization method based on multi-mode perception and agent decision is characterized by comprising the following steps:

Analyzing two-dimensional pixel information in the baseband video signal into a screen content set containing service information and content metadata through a multi-mode model, generating and sending a navigation instruction signal to an OTT network television client through an automatic exploration engine, directly driving an interface element of an App at a software layer through infrared and Bluetooth signals, and synchronously sampling the baseband video signal which is returned subsequently to construct a complete interaction sequence;

The method comprises the steps of carrying out association and aggregation on a plurality of interactive sequences obtained through cyclic exploration to generate a state signature, and marking the interactive program guide as a service access point according to the state signature;

The method comprises the steps of carrying out deep analysis on an internal structure of a television service model, applying a shortest path algorithm, calculating the minimum interactive steps from an access service access point to key television content assets and service access points, evaluating discoverability and navigation efficiency of paid contents and value added services, carrying out importance scoring and sequencing on the service access points by adopting a centrality measurement method, identifying a core content aggregation page and a navigation distribution hub, operating a community discovery algorithm, dividing the television service model into different service logic domains, carrying out time domain difference ratio pairs with a historical television service model, carrying out strategy generation by a decision analysis agent, finally compiling all analysis, comparison and generation results, and outputting a model data and a strategy evaluation report.

2. The method for optimizing OTT content operation based on multi-modal sensing and agent decision as claimed in claim 1, wherein the implementation process of sampling and digitizing the baseband video signal of the OTT network television client includes establishing a physical communication link between the OTT network television client and an external video acquisition device through a high-definition multimedia interface, generating the baseband video signal by the OTT network television client, performing digital conversion by the video acquisition device, and generating a static video frame representing a terminal display interactive program guide picture in real time.

3. The OTT content operation optimization method based on multimodal perception and agent decision according to claim 1, wherein the specific implementation process of parsing the baseband video signal into a set of screen content through the multimodal model includes inputting a still video frame into the multimodal model, recovering and extracting an image component encoded as a visual element in a client rendering process from a pixel domain of a television picture, identifying and classifying graphic elements carrying service information in the picture, calculating and outputting position information in a screen coordinate system for the identified graphic elements, performing optical character recognition for text elements, extracting character string content, and compiling the classification of all elements, character string content and position information into a set of screen content describing presentation content of the still video frame.

4. The OTT content operation optimization method based on multimodal perception and agent decision according to claim 1, wherein the specific implementation process of constructing a complete interaction sequence by an automated exploration engine includes the steps of analyzing a screen content set in the exploration process, identifying graphic elements of a switching page therein, adding the graphic elements into a to-be-explored queue, generating an interaction instruction by the exploration engine based on a predefined television service navigation strategy, sending the interaction instruction to a control executor, simulating to send a navigation instruction for changing the state of a television client by a physical remote controller device, and synchronizing with subsequent video frame sampling to complete systematic data acquisition and construct the complete interaction sequence.

5. The OTT content operation optimization method based on multi-mode perception and agent decision according to claim 1 is characterized in that the specific implementation process of the time domain difference ratio pair between the television service model and the plurality of history models comprises the steps of selecting a television service model of a current period and a television service model of a plurality of history periods as input of difference analysis, executing relevance ratio flow of service access points, establishing a one-to-one correspondence of the service access points between the current and history models based on each state signature, comparing matched service access point pairs one by one, identifying and recording changes of internal attributes, identifying non-corresponding service access points as newly added and deleted, identifying and recording newly added, deleted and redirected navigation paths caused by change of television service interaction logic based on the established service access point correspondence, calculating and quantifying differences of service menu structures, content entry positions and channel arrangement sequences in two maps, evaluating periodicity adjustment of content arrangement strategies, and compiling all changes into a part of time-sequential content and navigation change records.

6. The OTT content operation optimization method based on multi-modal sensing and agent decision as claimed in claim 1, wherein the specific implementation process of outputting a model data and strategy evaluation report comprises the steps of carrying out deep fusion on various television program performance indexes obtained in the deep analysis and content obtained in the time domain difference comparison with navigation change records to form an aggregation record representing platform operation dynamics, starting a content strategy analysis agent endowed with preset market targets and content arrangement criteria, carrying out comprehensive research and judgment on the aggregation record, identifying the degree of fit and difference points between the current platform content strategy and market operation hot spots, automatically generating content arrangement suggestions of next period content operation based on the degree of fit and the difference points, wherein the content arrangement comprises recommendation of key content types, optimization of popularization resource positions and response to emerging trends, carrying out logic assembly on the deep analysis, the time domain comparison and the content arrangement suggestions to form explanatory word descriptions, calling a graphic engine to generate a graph on performance index change of content distribution efficiency, and outputting the completed content as a static state file, and integrating the content strategy analysis report with the future operation strategy.