Detailed Description
The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, the "or" in the present application means at least one of the connected objects. For example, "A or B" encompasses three schemes, namely scheme one including A and excluding B, scheme two including B and excluding A, scheme three including both A and B. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The term "indication" according to the application may be either a direct indication (or an explicit indication) or an indirect indication (or an implicit indication). The direct indication may be understood that the sender explicitly informs the specific information of the receiver, the operation to be executed, the request result, and the like in the sent indication, and the indirect indication may be understood that the receiver determines the corresponding information according to the indication sent by the sender, or determines the operation to be executed, the request result, and the like according to the determination result.
The visual scheme in the prior art has the problems that (1) a user threshold is high, a user needs to master complex data processing and visual design knowledge, a configuration flow is complex and easy to use, and (2) the use efficiency is low, the traditional manual manufacturing method of a billboard and a report form is required to carry out flows such as requirement analysis, data processing, visual configuration and the like, and the rapid decision making of enterprises is often required, the flexibility is poor, the data analysis, the data processing and the chart configuration flow are required to face diversified user requirements, the data and the charts are not general, each requirement needs to be manufactured again, and the data value utilization rate is low, because the operation is complex, the visual result of the data analysis is generally lagged, the wide application of the data in the enterprises is not facilitated, the more potential value behind the mined data is not facilitated, and the high-efficient and effective data support cannot be provided for the development of the enterprises, so that at least one of the problems is solved.
It should be noted that, the conventional big data visual analysis platform generally requires users to have certain expertise and skills, and manually perform data processing, analysis and design and manufacture of visual charts. This process is not only cumbersome and time consuming, but also has a high threshold for non-professional users, making it difficult to efficiently implement data driven decisions. The application focuses on pain points in the traditional data analysis visualization process, and realizes breakthrough by utilizing a front edge technology. Aims to solve the following problems:
(1) The traditional data analysis visualization tool requires the user to have professional skills such as data processing, chart configuration and the like, and is complex in operation. The application aims at enabling a user to complete the whole flow operation from demand analysis to visual presentation by describing the demand only by natural language through natural language driving, reducing the use threshold and enabling non-professional personnel to easily perform data analysis.
(2) And the problem of insufficient individuation is that the traditional tool is difficult to meet the needs of diversification and individuation of users. According to the application, the user preference is learned through the visual generation countermeasure network (V-GAN), and the personalized visual patterns and analysis schemes are generated for different users and different scenes by combining multiple rounds of dialogue interaction, so that the processing requirements of the fine data are precisely matched.
(3) The problems of low efficiency and poor effect of the traditional data cleaning mode affect the accuracy and efficiency of subsequent analysis. The self-optimizing data cleaning strategy based on reinforcement learning is adopted, the cleaning algorithm is dynamically selected, the efficiency and quality of data cleaning are improved, and a reliable data basis is provided for data analysis.
(4) The problem of lack of intelligent decision support is that the traditional data analysis stays at the data display level, and the data cannot be actively analyzed and decision assistance can be provided. According to the application, the virtual data assistant is used for monitoring data fluctuation in real time, detecting abnormality automatically, recommending potential analysis dimension by combining with the knowledge graph, generating root cause analysis report and graph analysis report, helping users to deeply mine data value, and realizing mode upgrading from data visualization to intelligent decision support.
Referring to fig. 1, an embodiment of the present application provides a big data analysis visualization method based on a big model, including:
And step 11, receiving a natural language requirement input by a user, wherein the natural language requirement comprises at least one of text, voice and chart interactive labeling instruction.
The application supports various input forms such as text, voice, chart interactive labels (such as dragging highlight areas and gesture labels) and the like. For example, the user may generate the demand description by voice input "show sales trend in eastern China in March", or by chart interaction directly circling the data range. This step 11 supports multimodal input. The application uses Natural Language Processing (NLP) technology to segment, parse and mark semantic roles on the input content, extracts key elements (such as time range, index type, data dimension, etc.), and converts the key elements into structured query conditions. The application converts unstructured user requirements into machine-understandable semantic expressions, providing an explicit execution direction for subsequent analysis.
And step 12, decomposing a semantic layer, a logic layer and a task layer for the natural language requirement by adopting a preset hierarchical analysis model to generate an executable data analysis instruction.
The architecture of the hierarchical parsing model of the present application includes a semantic layer, a logic layer and a task layer. The semantic layer is based on deep learning models (e.g., BERT, GPT) to parse user intent, identifying core analysis targets (e.g., "predicted sales" or "contrast region differences"). And associating the business terms with the data fields by combining the knowledge graph. Logic layer-disassemble complex demands into executable subtasks (such as data cleaning, trend analysis, anomaly detection) and determine analysis flow (such as aggregation before visualization). Task layer-generating specific instruction sets including data source call, calculation logic (such as SQL query statement), visualization type (such as line graph, thermodynamic diagram), etc. The application converts natural language requirements into structured and programmable instructions, and ensures automatic execution of subsequent processes.
And step 13, acquiring target data required by the data analysis instruction.
In the application, the acquisition target data can extract data from a database, an API interface or a local file, and cross-system data fusion is supported. The method comprises the steps of data preprocessing, namely cleaning, namely processing missing values, repeated values and abnormal values (such as filling the missing data through an interpolation method), converting, namely normalizing numerical data, encoding classification variables (such as converting a gender field into a 0/1 label), and enhancing, namely improving the data integrity through interpolation and feature derivation (such as generating time window statistics). In this way, data quality and consistency are ensured, providing reliable input for analysis.
And 14, generating a basic visual chart according to the data analysis instruction, the target data, a preset chart recommendation algorithm and an aesthetic evaluation optimization algorithm.
In the application, the chart recommendation algorithm can automatically match the optimal chart type (such as a scatter chart is used for correlation analysis and a box chart is used for distribution display) based on data characteristics (such as dimension number and data distribution) and analysis targets (such as trend analysis and contrast analysis). Aesthetic evaluation optimization is used for color optimization, layout adjustment, and label optimization. Color optimization can adopt a color blindness friendly palette to avoid visual confusion. Layout adjustment can automatically arrange chart elements through a force-directed algorithm, reducing visual congestion. The label optimization can dynamically adjust coordinate axis scales and legend positions, and clear information transmission is ensured. And generating a basic visual chart by utilizing a data analysis instruction, the target data, a preset chart recommendation algorithm and an aesthetic evaluation optimization algorithm, and rapidly generating a visual chart meeting the data characteristics and the user requirements, thereby improving the information transmission efficiency.
And 15, generating a personalized visual result corresponding to the natural language requirement by visually generating a user preference characteristic of the countermeasure network learning and dynamically adjusting the basic visual chart according to the user preference characteristic.
In the embodiment of the application, the personalized adjustment of the countermeasure network is generated. For example, user preference modeling may learn user historical interaction data (e.g., click, hover, modify records) through a visually generated countermeasure network (e.g., VAE-GAN), extract preference features (e.g., preference cold hues, succinct layout). The dynamic adjustment strategy comprises chart style migration, namely adjusting visual attributes such as color schemes, font sizes and the like according to user preferences. Interaction enhancement, namely automatically adding data labels (such as outlier labels and trend line descriptions) focused by users. And (3) multi-view linkage, namely optimizing chart combination (such as superposing a histogram and a line graph to show trend and contrast) according to user feedback. Thus, personalized adaptation of the visual result is realized, and user experience and decision efficiency are improved.
The method reduces manual intervention from demand analysis to chart generation, improves efficiency (as the traditional method needs several hours, the method can shorten to a minute level), supports multi-mode interaction, supports text, voice and chart interaction, adapts to different use scenes, dynamically optimizes the capability, namely realizes more-used and more-accurate personalized service through GAN learning user preference, improves dynamic optimization capability, combines industry knowledge graph, enhances the field pertinence of analysis results, and improves field adaptability. The method realizes the efficient conversion from data to insight by fusing a large model, a visual algorithm and an interactive design, and has wide application value.
The application deeply fuses the large model, generates the frontier technologies of antagonism network, reinforcement learning and the like, and constructs an intelligent data analysis visualization system. The method has the core that full-process automation is driven by natural language, and the bottleneck that the traditional tool is complex in operation and insufficient in individuation is broken through. The intelligent analysis and task scheduling of the requirements are realized by utilizing a large model, the full-link optimization from data processing to visual presentation is realized by combining a visual generation countermeasure network (V-GAN) and an intelligent auxiliary algorithm, the data analysis is promoted to upgrade to an intelligent decision support mode, and the requirements of fine data processing in diversified scenes are met. Furthermore, the application provides a big data analysis visualization system based on a big model, which adopts a modularized design and is formed by cooperation of a user interaction module, a big model interaction module, a data processing module, a visualization generation countermeasure network module and a virtual data assistant module. The modules are divided into specific and tight cooperation, natural language is used as input drive, and full-flow automation from demand analysis, data processing and visual generation to intelligent decision support is realized through a core algorithm. The user interaction module is used as a man-machine interaction interface, receives the requirements and transmits the requirements to the large model interaction module, the large model interaction module analyzes the requirements and then schedules data processing and visual generation, the visual generation countermeasure network module gives personalized characteristics to visual results, the virtual data assistant module monitors data in real time and provides intelligent assistance, and finally a complete intelligent data analysis visual system is formed.
The user interaction module serves as an inlet and an outlet of user interaction with the system, and plays core functions of multi-mode input receiving, visual result displaying and user feedback collecting, and is used for realizing the basic visual chart in the step 11 and the step 14 and the personalized visual result displaying in the step 15. The method and the system have the advantages of reducing the user operation threshold by supporting diversified input modes such as natural language, voice, chart interactive labels and the like, recording the user operation history and feedback data, constructing a personalized preference model, providing a data basis for subsequent personalized services, and realizing the key links of accurate understanding and meeting of the user demands. The user interaction module analyzes a text instruction input by a user by adopting a natural language processing technology, converts a voice requirement into text information by utilizing a voice recognition technology, and recognizes the labeling intention of the user for the interactive labeling of the chart through an image processing algorithm. And displaying the visual result and the analysis suggestion provided by the virtual data assistant in the form of a webpage, a mobile terminal application and the like at the output end. By establishing a user behavior log system, each operation (such as instruction input, visual result viewing, style adjustment and the like) of a user is recorded, log data are analyzed by utilizing a machine learning algorithm, user preference characteristics such as common chart types, color preferences, analysis dimension trends and the like are extracted, and a personalized preference model is constructed.
The large model interaction module is used as a communication hub of the system and the large model and bears the core tasks of user demand analysis and instruction transmission. For example, the large model interaction module is used for converting the natural language of the user into the data analysis instruction and the visual design scheme which can be executed by the system by adopting a preset hierarchical analysis model (built-in hierarchical analysis algorithm) in the step 12, and accurately understanding the user correction requirement by utilizing a multi-round dialogue optimization mechanism, avoiding repeated calculation, greatly improving the response efficiency of the system and ensuring the accurate matching of the user requirement and the processing result of the large model. The hierarchical analysis algorithm adopts technologies such as Named Entity Recognition (NER) and syntactic analysis to decompose the natural language requirement of the user into three layers. At the semantic layer, feature extraction is carried out by utilizing a BERT model, key entities such as a data range (such as time and objects), analysis actions (comparison and trend analysis), a visual form (a billboard and a report) and the like are identified by combining a Conditional Random Field (CRF), at the logic layer, a data processing flow is constructed based on a knowledge graph, and SQL/Python pseudo code instructions are generated through a graph traversal algorithm. For example, for a demand "compare different product line gross rates for each quarter of 2023", the instructions that may be generated are "SELECT quarter, product line, AVG (gross rate) FROM sales table WHERE year=2023 group y quarter, product line". And in the task scheduling layer, a dependency graph G= (V, E) is established, a node V represents a system module, an edge E represents a data flow direction, and only relevant modules are updated by marking nodes affected by correction instructions, so that repeated calculation is reduced, and the response efficiency is improved by more than 60%.
The multi-round dialogue optimization is based on a transducer architecture to construct a dialogue management model, and semantic similarity between a historical dialogue and a current instruction is calculated through a multi-head attention mechanism. Let the current demand vector be x_t, the set of historical demand vectors be { x_1, x_2., x_ { t-1}, the attention weight calculation is as follows:
Where q=x t;K=V={X1,X2,...,Xt-1};dk is the vector dimension. By the mechanism, accurate understanding and execution of the correction instruction are realized, and the loss of context information is avoided.
Specifically, step 12 described above includes:
Extracting semantic layer characteristics of the natural language requirements by adopting a BERT layer in the preset hierarchical analysis model and combining a conditional random field, and extracting a key entity set comprising a data time range, an analysis object, an analysis action type and a visual form;
Based on the domain knowledge graph in the preset hierarchical analysis model and the key entity set, performing logic layer conversion by adopting a graph traversal algorithm to generate a target pseudo code comprising data screening conditions, aggregation dimensions and output forms;
and decomposing a task layer based on the module dependency graph and the target pseudo code in the preset hierarchical analysis model to generate an executable target instruction corresponding to task generation or task correction.
In the embodiment of the application, the BERT layer in the preset hierarchical analysis model is adopted for extracting the characteristics of the semantic layer. And generating the BERT word vector, namely encoding natural language requirements by utilizing a pre-training BERT model, and generating the word vector containing context semantics. For example, input "analyze the trends in Q2 sales in 2023 in eastern China," BERT maps "eastern China" to vectors containing geographic information, and "Q2" to time codes. Based on the BERT output, the key entities are identified by sequence labeling through a Conditional Random Field (CRF). For example, TIME ranges labeled "2023 years Q2" as B-TIME (TIME beginning) and I-TIME (TIME ending), analysis objects labeled "sales" as B-METRIC (index), "Huadong REGION" as B-REGION (REGION), and visualization forms in which implicit needs are identified by context, such as "trend" for LINE_ CHART (LINE graph). Thus, the core elements (time, object, action and visual type) in the user requirements are accurately extracted, and structural input is provided for subsequent logical reasoning. The semantic understanding capability of BERT is combined with the sequence modeling of CRF, solving the ambiguity problem of traditional rule matching.
The logical layer conversion of the application is realized by utilizing a knowledge graph and graph traversal. The domain knowledge graph can be constructed by adopting a preset business domain knowledge graph, and the preset business domain knowledge graph comprises entity relations (such as sales, statistical dimension and area by area), function mapping (such as trend and moving average) and the like. For example, association rules of "sales" and "time" and "area" are defined in the knowledge graph. The graph traversal algorithm application can traverse the knowledge graph through Breadth First Search (BFS) or Depth First Search (DFS) based on the key entity set, and generates a logic expression as a filtering condition of 'data filtering condition' traversing from 'eastern China' nodes to 'province=eastern China'. And (3) aggregating the dimension, namely generating the GROUP BY time dimension according to the association of the sales and the time. Output form, content of visualization type=line graph "is deduced in combination with" trend "semantics.
Pseudo code generation is the conversion of logical expressions into an executable code framework. The method and the device realize the conversion of natural language requirements into structured logic rules, and ensure the accuracy and the executable performance of a data analysis flow. The field knowledge embedding of the knowledge graph avoids the illusion problem of the general model.
The task layer decomposition is realized by adopting a module dependency graph and topology sequencing. In the process of constructing the module dependency graph, firstly, defining the dependency relationship of the data analysis task, such as a data acquisition module, a data cleaning module, an aggregation analysis module and a visualization module. Each module corresponds to a subtask (e.g., data cleansing includes missing value padding, outlier handling). And performing topological sorting according to the dependency graph, and determining the task execution sequence. For example, performing data retrieval (loading raw data from a database). Missing value padding and anomaly detection for data cleansing are performed in parallel. The aggregation analysis and visualization generation are performed serially.
If a data quality problem is detected (e.g., the missing value exceeds a threshold value), a task modification instruction, such as adding a data interpolation step, is triggered. Therefore, the automatic disassembly and parallel optimization of complex tasks are realized, and the execution efficiency is improved. Topology sequencing ensures that task dependencies are strictly satisfied, avoiding resource contention and deadlock.
The application realizes end-to-end semantic analysis, and forms a complete semantic understanding chain from fine-grained entity identification of BERT to logic reasoning of knowledge graph. Dynamic task arrangement is realized, flexible task splitting and parallelization are supported through a module dependency graph, and the requirements of data analysis of different scales are met. The field knowledge embedding realizes that the knowledge graph displays the business rules, reduces the dependence on the labeling data, and improves the robustness in a small sample scene.
The input of the preset hierarchical analysis model is "showing the distribution of user liveness in the region X of the recent March", the output of the preset hierarchical analysis model is that a semantic layer is that time=recent March, an object=user liveness in the region X, a visualization=thermodynamic diagram, a logic layer is that screening time ranges are that are 2024-02 to 2024-04, aggregation dimension is that of date, a statistical function is that of count, and a task layer is that steps of acquiring original data, cleaning the data, aggregating according to the date and generating the thermodynamic diagram are sequentially executed in parallel. The method reduces the threshold of data analysis through structural analysis and automatic arrangement, and is suitable for scenes such as enterprise report forms, scientific research analysis and the like.
Optionally, after generating the personalized visual result corresponding to the natural language requirement, the method further includes:
Real-time monitoring the original data stream associated with the personalized visual result through a virtual data assistant module, calculating an abnormal score of a time sequence predicted value and an actual value, triggering abnormal early warning when the abnormal score meets the condition that N continuous periods deviate from a preset value or the ring ratio increase rate exceeds a preset threshold, and generating an abnormal event record;
when triggering abnormal early warning is detected, acquiring the weight of an abnormal event and a picture-text report corresponding to the personalized visual result, and generating a multi-mode decision report based on a multi-mode large model;
Collecting feedback data of a user according to the multi-mode decision report, and updating the preset chart recommendation algorithm to be a target algorithm;
And regenerating a target visualization result according to the target algorithm.
It should be noted that this step is implemented with a virtual data assistant (Data Copilot) module of the big data analysis visualization system based on big model, which provides active data intelligent assistance to the user. The method comprises the steps of monitoring a data state in real time, timely finding out data fluctuation through an anomaly detection and early warning algorithm and triggering early warning, actively recommending potential analysis dimensions in a user input demand stage based on a knowledge graph and an intelligent recommendation algorithm, expanding the user data insight depth, automatically generating a root cause analysis report and an image-text analysis report, providing decision support for a user, and realizing full-flow assistance from data monitoring to intelligent decision.
In the embodiment of the application, the original data flow associated with the personalized visual result is monitored in real time through the virtual data assistant module, and the method is realized based on a preset LSTM-isolated forest mixed model. In the LSTM-isolated forest hybrid model, LSTM timing prediction is performed first. Multi-step time sequence prediction (such as predicting 3 period values in future) is performed on the original data stream, and a predicted value sequence is outputThen, calculating abnormal scores (St) based on real-time data characteristics (such as fluctuation amplitude and historical deviation), wherein an abnormal judgment formula is as follows:
The triggering condition comprises that the average value of continuous N periods deviates + -sigma times, the ring ratio increase rate exceeds a threshold value, and the like, and the actually measured abnormality detection accuracy rate reaches more than 95%. Here, the abnormality determination rule triggers an early warning if any one of the conditions that the prediction error exceeds ±3σ (standard deviation) for consecutive N cycles, the ring ratio increase rate is greater than a preset increase rate, and the abnormality score is greater than a preset abnormality score threshold is satisfied. Therefore, real-time health evaluation of the data flow is realized, potential risks are identified in advance, and service loss is avoided.
When triggering abnormal early warning is detected, acquiring the weight of an abnormal event and an image-text report corresponding to the personalized visual result, and generating a multi-mode decision report based on a multi-mode big model, wherein the multi-mode decision report comprises the steps of inputting fusion data in the multi-mode big model, wherein the fusion data comprises abnormal event data (structural index), the image-text report (text+chart) and a knowledge graph (association relation) into a multi-mode encoder in the multi-mode big model. Semantic spaces of different modalities are aligned by contrast learning, for example, a "sales decline" text description is associated with a decline trend in a histogram. Based on causal relationships among mining indexes of Graph Neural Networks (GNNs), for example, "raw material price increase, production cost increase, profit decrease". Report generation policies are generated using a multi-modal large model. The report generation strategy includes problem diagnosis, advice generation, and visually enhanced advice. Problem diagnosis is automatic generation of analysis of cause of abnormality. The recommendation generation recommends coping strategies (e.g. "recommends increasing online channel budget, optimizing promotion period") for combining business rule bases. The visualization is enhanced to automatically insert auxiliary analysis charts such as contrast charts (abnormal period vs baseline period), causality charts and the like. Therefore, the technical abnormal data is converted into the business executable decision suggestion, and the problem response efficiency is improved.
And collecting feedback data of the user according to the multi-mode decision report, updating the preset chart recommendation algorithm to be a target algorithm, and performing algorithm iterative optimization closed loop, wherein the feedback data collection and the target algorithm updating are embodied. Feedback data collection includes explicit feedback, i.e., scoring of decision suggestions by the user, modifying records (e.g., adjusting chart types), and implicit feedback, i.e., interaction behavior of the user with the visual results. The target algorithm updating comprises feature importance reevaluation, namely adjusting the weight of a chart recommendation algorithm based on feedback data, such as finding a user preference of 'dynamic interactive charts', improving the priority of related algorithms, knowledge graph enhancement, namely expanding the side relation of the knowledge graphs according to a newly found association mode, and model parameter optimization, namely optimizing the super-parameters (such as learning rate and hidden layer dimension) of an LSTM prediction model by using reinforcement learning. Thus, a closed loop system of data, insight, decision, feedback and optimization is formed, and the analysis precision is continuously improved.
The application has end-to-end anomaly management, full link coverage from detection (LSTM-isolated forest) to decision (multi-mode report), shortened mean time between failure recovery (MTTR), real-time updating of knowledge graph through user feedback, adaptation to service scene change, and balance of algorithm performance and service requirement by combining manual feedback and automatic parameter adjustment. The method builds a complete closed loop from data monitoring to intelligent decision-making by integrating time sequence analysis, graph reasoning and multi-mode generation technology, and improves the operation efficiency and risk resistance of enterprises.
Optionally, based on the knowledge graph, an index association network g= (V, E) is constructed, the node V represents the index, and the edge E represents the association relationship. A graph convolution operation using a Graph Neural Network (GNN), as shown in the formula:
Wherein, the For the feature vector of node i at layer i,Is the neighbor node set of node i. Potential analysis dimensions, such as sales associated with guest price sales volume and regional products, are mined by calculating attention weights among nodes, analysis view angles are actively recommended, and user data insight depth is expanded.
Specifically, when triggering abnormal early warning is detected, acquiring weight of an abnormal event and an image-text report corresponding to the personalized visual result, and generating a multi-mode decision report based on a multi-mode big model, wherein the method comprises the following steps:
when triggering abnormal early warning is detected, inputting the abnormal event record into a preset index association network to determine the weight of the abnormal event, wherein the index association network is constructed through a time sequence attention mechanism and a graph neural network according to a text log, time sequence data and image data;
mapping the weight of the abnormal event into a weight vector;
determining text embedding characteristics and visual characteristics according to the image-text report;
And splicing the weight vector, the text embedded feature and the visual feature into a multi-modal input sequence, inputting the multi-modal input sequence into a pre-training multi-modal large model, and generating a multi-modal decision report.
It should be noted that the index association network is a calculation model based on a graph structure, and is used for modeling the dynamic association relationship of the monitoring index in the complex system by fusing multi-mode data (text log, time sequence data, image data and the like).
In the embodiment of the application, under the condition that triggering abnormal early warning is detected, the weight of the abnormal event needs to be calculated. Here, first, an index association network needs to be constructed in advance. Acquiring text logs, time sequence data and image data recorded by historical abnormal events, integrating the text logs, the time sequence data and the image data to form integrated data, establishing a space-time association relation between indexes through the integrated data and a preset Graphic Neural Network (GNN), and constructing an index association network by combining the space-time association relation with a time sequence attention mechanism. For example, the index-related network is represented as g= (V, E), the node V represents the index (the integrated data obtained by integrating the text log, the time series data, and the image data as described above), and the side E represents the weight of the abnormal event corresponding to the abnormal event record. The time-series attention mechanism, such as the attention mechanism using a transducer, can be used to dynamically evaluate the impact weight of abnormal events at different points in time.
After the abnormal event record is input into the constructed index association network, the index association network can output the weight of the abnormal event. And then, feature extraction and fusion are carried out by utilizing the weight of the abnormal event, so that the abnormal weight can be converted into a high-dimensional vector (such as through a full connection layer), and weight vector mapping is carried out. Extracting key region features of the image through ResNet-50, extracting multi-mode features, and determining text embedded features and visual features;
and finally, splicing the weight vector, the text embedded feature and the visual feature along the channel dimension to form a multi-mode input sequence. The multi-modal input sequence after input splicing is subjected to multi-modal large model, and the model automatically generates a report which comprises the following contents of problem diagnosis based on knowledge graph correlation abnormal events and historical cases, a proposal strategy comprising a recommended repair scheme, and visual contents of auxiliary decisions such as fault component labeling diagrams, maintenance flow diagrams and the like.
The method integrates text, time sequence and image data, improves the comprehensiveness of exception analysis, highlights recent events through a time sequence attention mechanism, enhances decision timeliness, and provides a fault causal chain by the associated knowledge graph to assist manual review.
Optionally, generating a basic visual chart according to the data analysis instruction, the target data, a preset chart recommendation algorithm and an aesthetic evaluation optimization algorithm includes:
Constructing a histogram or thermodynamic diagram based on a visual mapping knowledge base and a decision tree recommendation algorithm according to the data analysis instruction and the target data, wherein the visual mapping knowledge base comprises a triple mapping relation established by an index type, an analysis target and a chart type;
Optimizing the evaluation index of the histogram or thermodynamic diagram based on a multi-objective genetic algorithm of aesthetic evaluation optimization to determine an optimal solution set, wherein the evaluation index comprises the symmetry, color contrast and information density of the histogram or thermodynamic diagram;
And iteratively adjusting the coordinate axis distance, the color ring angle and the font size parameters of the histogram or thermodynamic diagram by utilizing the optimal solution set to generate a basic visual chart.
The visual mapping knowledge base of the application presets a triplet rule base such as sales, trend analysis, line graph, user distribution, space analysis, thermodynamic diagram, product comparison, class comparison, bar graph. The optimal chart type is dynamically selected according to analysis targets (such as trend and comparison) and index types (such as numerical type and geographic type) through a decision tree algorithm. The data adaptation logic includes, in the histogram, comparisons applicable to multiple categories (X-axis) and single values (Y-axis), such as sales for different provinces. In thermodynamic diagrams, density or intensity displays, such as orders per hour for each region, are applied to two-dimensional data (e.g., time-by-region). And by means of the structural rule base and decision tree reasoning, the high matching of the chart type with the data characteristics and the analysis target is ensured, and human type selection errors are avoided.
Under the condition that the multi-objective genetic algorithm optimizes aesthetic indexes, algorithm design and optimization objectives are needed, and evaluation index quantification of the optimization objectives comprises symmetry, color contrast and information density. Symmetry is used to represent the symmetry score (e.g., the coefficient of the base of the difference in column heights) of the distribution of elements of the calculated chart, color contrast is used to represent the contrast of the calculated foreground and background colors, and information density is used to represent the effective amount of information per unit area.
The multi-objective genetic algorithm construction flow is that a population is initialized, and a plurality of groups of chart parameters (such as color ring angles and font sizes) are randomly generated. And calculating fitness, namely calculating fitness value by integrating multiple target weights (such as symmetry weight 0.4, contrast 0.3 and information density 0.3). Selecting and crossing, namely reserving high-quality individuals by adopting a tournament selection method, and generating a new solution by simulating binary crossing. And performing mutation operation, namely performing Gaussian disturbance on parameters and maintaining population diversity. And (5) convergence judgment, namely ending iteration when the continuous five-generation adaptability change is smaller than the threshold value.
The optimal solution set is utilized to iteratively adjust the coordinate axis distance, the color circle angle and the font size parameters of the histogram or thermodynamic diagram, wherein the coordinate axis distance is adjusted according to the data range, the coordinate axis distance is adjusted (when the sales volume range is 100-1000, the main scale is 100, the secondary scale is 20), the hue rotation angle is adjusted based on the data distribution (when time sequence data is in clockwise gradual change, geographic data is in longitude and latitude mapping), the color circle angle is adjusted, and the font size is adjusted according to the chart size and the information level.
The input of the multi-objective genetic algorithm is an original histogram (small difference in column height and single in color), the multi-objective genetic algorithm generates 100 groups of parameter combinations, a solution set with information density of >0.8 and contrast of >4.5 is screened out, and the parameter with optimal symmetry (column spacing=data standard deviation×0.5) is selected. Outputting the adjusted histogram (column width optimization, gradient color enhancement and label avoidance). Thus, the attractive performance and the readability are balanced through the fine adjustment of the parameters, the high-quality chart conforming to the human visual cognition is generated, and the attractive performance and the readability are balanced through the fine adjustment of the parameters, so that the high-quality chart conforming to the human visual cognition is generated.
Optionally, the present application employs a visualization generation module in the above system to handle the above steps. And converting the processed data into visual and attractive visual charts. The method comprises the steps of matching an optimal chart type according to data characteristics and analysis targets by means of an intelligent chart recommendation algorithm, carrying out quantitative scoring and automatic optimization on the chart from professional and aesthetic dual dimensions by means of an aesthetic evaluation optimization algorithm, and simultaneously providing a high-quality visual result for a user by combining a personalized pattern generated by a V-GAN module, wherein the visual result is a core module for realizing visual display of data.
The intelligent chart recommendation algorithm is that a visual mapping knowledge base is constructed, and a triple mapping relation is established between index types (dimensions and metrics), analysis targets (comparison, distribution and association) and chart types. The chart type classification is carried out by adopting a C4.5 decision tree algorithm, and the decision tree node division is shown by the following formula according to the information gain rate:
Gain (S, A) is the information Gain, splitInfo (S, A) is the split information. For example, decision trees prefer to recommend a line graph for analysis targets of "time series addition" and a histogram or thermodynamic diagram for "multi-dimensional comparison".
The aesthetic evaluation optimization algorithm is based on a lattice tower theory design evaluation index system, comprises dimensions of symmetry S sym, color contrast S col, information density S den and the like, and adopts non-dominant ranking genetic algorithm II (NSGA-II) to carry out multi-objective optimization. The fitness function is shown in the following formula:
F=ω1·Ssym+ω2·Scol+ω3·Sden;
Wherein ω 1,ω2,ω3 is a weight coefficient. Through iterative optimization, the positions of chart elements, color schemes, interval layout and the like are automatically adjusted, and the professionality and the attractiveness are ensured.
Optionally, generating, by visualization, a user preference feature for the countermeasure network learning, and dynamically adjusting the base visualization chart according to the user preference feature, and generating a personalized visualization result corresponding to the natural language requirement, including:
based on a user operation history log, extracting style feature vectors of the history visual work through theme modeling, wherein the style feature vectors comprise high-frequency color values, font use preference and component layout density features;
constructing a visualized generation countermeasure network according to the style feature vector;
injecting parameters output by a generator of the visualized generation countermeasure network into the basic visualized chart, and executing theme color mapping, font dynamic replacement and responsive layout reconstruction to generate a target result;
and verifying the visual consistency of the generated pattern corresponding to the target result and the historical works of the user based on the discriminator of the visual generation countermeasure network, and generating a personalized visual result corresponding to the natural language requirement after the consistency is passed.
In the embodiment of the application, firstly, user preference feature extraction is performed, for example, topic clustering is performed on user history operation logs (such as clicking and modifying records) by using LDA (LATENT DIRICHLET Allocation) or BERTopic, and features such as high-frequency color values, font preference, component layout density and the like are extracted. A style feature vector v= [ h1, h2, ], wherein hi represents the statistical weight of the class i visual attribute. An attention-introducing mechanism (e.g., transducer) dynamically weights recent operations (e.g., recent 3 months preference for cold hue), thereby enabling quantification of user visual preferences, providing a learnable a priori distribution for GAN.
Wherein the visual generation of the countermeasure network construction includes a network architecture design. The network architecture includes a Generator (Generator) and a arbiter (Discriminator). The input to the Generator (Generator) is the random noise z + the user style vector v. The structure of the generator adopts a mapping network of V-GAN to map V into an intermediate potential space and control chart style parameters (such as color plates and font sizes). The output of the generator generates a parameterized description of the graph. The input to the arbiter (Discriminator) is the real chart or the characteristics of the generated chart (extracted by ResNet-50). The loss function of the arbiter needs to combine the fight loss with the style consistency loss. The training strategy for visually generating the countermeasure network uses transfer learning, and the pre-training generator designs the data set on a large scale.
And generating an countermeasure network based on the visualization, and performing parameter injection and chart generation. And replacing the primary color and the secondary color of the basic chart according to the color plate parameters output by the generator, and realizing the theme color system mapping. For example, if the user prefers the cyan color system, the default red color is adjusted to the cyan color system. Based on the font preference vector, the font family with the highest matching degree is selected (if the user uses the "A font" to occupy 70% of the history, then the application is preferred). And according to the layout density characteristics, the arrangement of the components is adjusted, for example, the grid layout is adopted, the space between the components is reduced to 5px, and for example, the card layout is used, so that the low-density preference of the white-keeping area is increased.
By way of example, the visualization generates an input-based histogram (default blue-bottom white, equal width layout) of the countermeasure network. The output of the visualized generated countermeasure network is a personalized histogram (blue-green gradual ground color, a black body, an asymmetric layout).
According to the application, parameters output by a generator for generating an countermeasure network based on the visualization are injected into the basic visualization chart, theme color mapping, font dynamic replacement and responsive layout reconstruction are executed, and after a target result is generated, consistency verification and result output are carried out. The verification process is feature similarity calculation, which comprises the steps of generating style vectors of the chart and the user history chart by using cosine similarity comparison, and if the cosine similarity is larger than or equal to a similarity preset threshold value, passing consistency test. Triggering a semi-automatic correction process in a preset range of a low-similarity result, such as a preset similarity preset threshold value, for example, performing a manual rechecking mechanism, and manually adjusting the saturation of the color plate.
According to the application, after consistency is passed, an explanatory report is attached to a personalized chart passing verification in a personalized visual result corresponding to the natural language requirement (for example, the chart adopts a dark blue system and a card layout which are preferred by you).
The application combines the topic modeling with the time sequence attention for the first time, accurately captures the dynamic preference of the user, realizes fine granularity style control through parameter injection, avoids uncontrollability of a GAN generation result, has the function of man-machine collaborative verification, such as mixed automatic scoring and manual rechecking, and balances the generation efficiency and quality.
Specifically, the present application employs the visual generation antagonism network (V-GAN) module in the above system to implement the above steps. Personalized features are assigned to the visual results. Through the countermeasure training of the generator and the discriminator, preference characteristics in the user operation history are learned, visual style parameters which accord with the style of the user, such as a color scheme, font types and the like, are generated, differential visual presentation is realized, various aesthetic and use requirements of the user are met, and user experience is improved.
The generator in the personalized style generation algorithm adopts a U-Net architecture, inputs a user preference vector P (extracting characteristics of theme style, layout form and the like from operation histories), and outputs style parameters Y in CSS/JSON format. The discriminator performs local discrimination on the generated pattern based on the V-GAN structure. During the countermeasure training process, the generator goal is to minimize the formula:
Wherein, the Representing a loss function of the generator, wherein the loss function is used for measuring the difference degree between the visual pattern generated by the generator and the real pattern, and the training aim of the generator is to minimize the loss function so that the pattern generated by the generator is closer to the real pattern; And y in y-p data represents visual pattern parameters obtained by sampling from the real data distribution p data, and the parameters are visual pattern descriptions which are actually existing and meet the actual requirements. D (y) is the result of the discrimination of the pattern parameter y sampled from the real data distribution by the discriminator, the probability value between 0 and 1 is outputted, and the closer to 1 is the more true the pattern is considered by the discriminator. And P in P-P represents a user preference vector obtained by sampling from the user preference vector distribution P, wherein the vector contains characteristic information such as theme style, layout form and the like extracted from user operation histories. G in G (p) is a generator, and G (p) is a visualized style parameter in CSS/JSON format which is output by the generator by taking the user preference vector p as input. D (G (p)) represents the result of the discrimination of the pattern parameter G (p) by the discriminator to the generator, and is also a probability value between 0 and 1, and the larger the value is, the better the generator expects, i.e. the pattern the discriminator erroneously generates for itself is true.
The objective of the arbiter is to maximize the formula:
Wherein, the Representing the loss function of the discriminant, the training goal of the discriminant is to maximize the loss function to improve its ability to distinguish between true patterns and generated patterns.Representing the co-generator formula, for the mathematical desired symbol. And y-p data, wherein the meaning of the visual style parameter is consistent with that of a generator formula, namely the visual style parameter is obtained by sampling the real data distribution p data. D (y) represents the result of discrimination of the true pattern parameter y by the discriminator, which is expected to be as close to 1 as possible, i.e., accurately recognize the true pattern. P-P represents the user preference vector sampled from the user preference vector distribution P. G (p) represents the visualization style parameters generated by the generator based on the user preference vector p. D (G (p)) represents the result of the discrimination of the discriminator to the generator output pattern parameter G (p), which the discriminator expects to be as close to 0 as possible, i.e. accurately recognizes that the generated pattern is spurious, whereas 1-D (G (p)) is the target for the maximization of the loss function by the discriminator in the logarithmic calculation.
Through alternate training, the style laminating degree is improved, mode collapse is avoided, and differential visual presentation is realized.
Optionally, acquiring target data required by the data analysis instruction includes:
Constructing a self-optimizing data cleaning pipeline based on the reinforcement learning framework;
acquiring preset data required by the data analysis instruction, and cleaning the preset data according to the self-optimizing data cleaning pipeline, wherein the data cleaning comprises filling a missing value in a numerical field of the preset data by adopting a K nearest neighbor interpolation method, and filling a missing value in a classification field of the preset data by adopting a sliding window mode;
performing data cleaning on the preset data, and then performing outlier joint detection;
and under the condition that the verification rate of the abnormal value joint detection is larger than a preset threshold value, determining that the verified data is the target data.
In the embodiment of the application, the reinforcement learning framework can be selected by adopting a distributed reinforcement learning framework to construct a dynamic optimization model based on the intelligent agent. The reward function of the model targets data cleansing efficiency and quality as optimization targets (e.g., processing speed boost rate, data integrity score). Pipeline dynamic tuning, namely, through interaction of an agent and an environment (environment simulation data cleaning task), a cleaning strategy (such as interpolation algorithm selection and window size setting) is adjusted in real time. The core operation of the preset data cleaning flow is numerical field missing processing, namely, based on a K neighbor interpolation method, K nearest neighbors (default K=5) of a sample in a feature space are selected, and missing values are filled through weighted average. The weight is calculated by inverse distance. And (3) carrying out deletion processing in the type field, namely adopting sliding window mode filling, dynamically adjusting the window size (e.g. adaptively setting the window to be 50-200 records according to the data distribution characteristics), and filling the deletion value by taking the category with the highest occurrence frequency in the window. The method for the outlier joint detection mechanism is realized by combining a mixed detection strategy with a statistical method and a machine learning model for cross verification. And reinforcement learning auxiliary optimization, namely dynamically optimizing the super parameters of the detection model by using a deep reinforcement learning framework, and improving the generalization capability of anomaly detection. The decision rules for validating against the target data include validation rate calculation and dynamic threshold adjustment. Verification rate calculation is to evaluate the abnormality detection effect based on the harmonic mean (F1 Score) of Recall (Recall) and Precision (Precision). For example, verification is determined to pass when F1 Score > 0.85. In the dynamic threshold adjustment process, if the preset threshold does not meet the service requirement, the optimal threshold interval is reversely deduced through a multi-objective optimization algorithm. The application realizes the dynamic optimization of data cleaning through the reinforcement learning framework, is directly related to the distributed training capacity and the interactive parameter adjusting characteristic, combines the statistical method of the characteristic engineering and the strategy optimization of the deep reinforcement learning, and embodies the cross application of the machine learning and the reinforcement learning.
The application can adopt the data processing module in the system to realize the steps, including cleaning, conversion and preprocessing of the original data, and provide high-quality and reliable data for subsequent visual analysis. Different data characteristics are dynamically adapted through a self-optimizing data cleaning algorithm, missing values and abnormal values are efficiently processed, data processing metadata are output, real-time data quality monitoring of a virtual data assistant is supported, and the method is a key link for guaranteeing data analysis accuracy.
The self-optimizing data cleaning algorithm is characterized in that a reinforcement learning framework is built based on a near-end strategy optimization (PPO) algorithm, a state space S comprises data features (deletion rate, abnormal value proportion and the like), an algorithm candidate set A= { a 1,a2,...,an } (such as KNN interpolation and IQR detection), and a reward function R comprehensively considers indexes such as cleaning time consumption T, data integrity I, consistency C and the like, and the indexes are shown in a formula: Wherein, R represents a reward function value for evaluating the merits of the current data cleansing strategy. α, β, γ are respectively expressed as weight coefficients, and control the importance degree of each index, (α+β+γ=1). I represents the data integrity (0-100%) after the current cleaning strategy treatment. I 0 represents the integrity of the raw data before cleaning. I max represents the maximum data integrity that can be achieved theoretically (typically set to 100%). C represents the data consistency score (0-100%) after the current cleaning strategy treatment. C 0 represents the consistency score of the raw data before cleaning. C max represents the maximum data consistency score that can be achieved theoretically (typically set to 100%). T represents the actual execution time (seconds) of the current purging strategy. T max represents a preset maximum acceptable cleaning time threshold.
The algorithm parameters are dynamically adjusted through Bayesian optimization, and for the numerical data missing values, a KNN interpolation formula is adopted for calculation: Wherein, the Representing the missing value estimate to be filled. K represents the number of neighbors (super parameters, determined by bayesian optimization) of the K-nearest neighbor algorithm. x j represents the corresponding eigenvalue of the j-th neighbor sample. w j represents the distance weighting coefficient of the j-th neighbor sample, and the calculation formula is: where d (x i,xj) represents the euclidean distance between sample x i and the j-th neighbor sample x j, the formula shows that the closer the distance is, the greater its weighting coefficient, and the greater the impact on the missing value estimation.
And combining the abnormal value detection with the IQR and the isolated forest algorithm, and determining a final result through an integrated voting mechanism. The cleaning efficiency is improved by 80% compared with the traditional method, and additional metadata (such as the change of data volume and abnormal value proportion before and after cleaning) is output for real-time monitoring by a virtual data assistant.
In a specific application scenario provided by the application, for example, in a financial industry risk analysis scenario, a user input of a banking wind control department analyzes the overdue rate trend of the A card of each branch in the last half year, compares overdue distribution of credit ratings of different clients, and displays and marks branches with abrupt overdue rate changes by using an instrument panel. The user interaction module in the system of the application is that a user inputs the requirement through natural language, the system records the operation and transmits the requirement to the large model interaction module. In the visual result display stage, the user can interactively mark the chart, such as selecting a certain branch data area, and the detailed overdue data of the branch can be further displayed after the system is identified.
The large model interaction module comprises the steps of extracting entities such as 'nearly half year', 'each branch', 'A card overdue rate', 'client credit rating' and the like at a semantic layer, generating SQL instructions by a logic layer, acquiring credit data of 'time, branch, overdue rate and credit rating', acquiring time information, generating visual configuration to show overdue rate trend for a line graph and show overdue distribution of credit rating for a stacked histogram according to the credit data and the time information, and updating only the affected module when a subsequent user corrects the requirement according to a dependency graph by a task scheduling layer.
The data processing module comprises 15% of missing values and 8% of abnormal values in original data, a self-optimizing data cleaning algorithm is subjected to reinforcement learning decision, KNN interpolation (k=5) is adopted for the missing values of the overdue rates of the numerical values, mode filling is adopted for the missing values in the classified fields of the credit ratings, 32 abnormal data are identified and corrected through combination of IQR and an isolated forest algorithm, and the integrity of the cleaned data is improved from 85% to 98%, and the consistency is 96%.
The visual generation module is used for selecting a line graph and a stacked bar graph according to analysis targets of time sequence and measurement and multi-dimensional comparison by an intelligent chart recommendation algorithm, and adjusting chart layout by an aesthetic evaluation optimization algorithm through an NSGA-II algorithm to enable symmetry scores to be improved from 68 to 89 and color contrast to be improved from 72 to 85, so that a professional and attractive instrument panel is generated.
The virtual data assistant module of the application monitors that the overdue rate of a branch line is suddenly changed in 11 months by an abnormality detection algorithm, the abnormality score reaches 0.91 (threshold value 0.7), automatically invokes a large model to generate a root cause analysis report, indicates that a large number of secondary credit clients are added to the branch line in 11 months, and recommends that the credit auditing strength of new clients of the branch line is increased.
In another specific application scenario provided by the application, for example, in a patient data analysis scenario in the medical industry, a hospital manager inputs a statistics of average hospitalization duration of patients in each department in 2023, analyzes the difference of the surgical success rates of patients in different age groups, generates a comparison report and highlights an abnormal department. The user interaction module receives the natural language requirement of the user, and after the visual report is displayed, the user can request the system to enlarge the data area of a certain age range through the voice command, and the system transmits the command to the related module for processing. According to the large model interaction module, SQL instructions are generated after the requirements are analyzed, so that the data of a department, AVG (duration of hospitalization), an age group and operation success rate are obtained, the data of the department with year=2023 of a patient and related age group information are configured into a histogram through visualization, and the operation success rate of each age group is displayed by comparing the average duration of hospitalization of the department and the case diagram. The data processing module processes the original data containing 20 ten thousand records, the cleaning algorithm dynamically selects a proper strategy, the data cleaning time is shortened to 8 minutes from 45 minutes in the traditional method, and the integrity is improved to 97 percent from 82 percent. The visual generation module is used for generating the visual report, which is subjected to aesthetic evaluation optimization, has reasonable information density and is convenient for quickly acquiring key information. The virtual data assistant module of the application detects that the success rate of the bone surgery is abnormally low in a certain age group, recommends 'comparative analysis of basic disease condition of patients in the age group', and assists hospitals in deeply analyzing reasons.
In another specific application scenario provided by the application, such as in an energy consumption monitoring scenario, a manufacturing enterprise uses the system to monitor plant energy consumption data. The virtual data assistant module finds that a certain production line suddenly rises by 40% in power consumption of unit product during the period from 2 points to 4 points on the Tuesday, and the abnormality detection algorithm judges that the production line is abnormal (the abnormality score is 0.92). The system rapidly generates a root cause analysis report, speculates that the equipment is likely to age to cause the energy consumption to increase, and recommends to schedule equipment maintenance teams for inspection, and simultaneously optimizes a production scheduling plan to avoid electricity consumption peak periods. After the enterprises act according to the advice, the power consumption of the unit products in the following week of the production line is reduced by 18 percent.
In summary, the present application is based on a process with natural language driven full-flow automation processing. The method comprises the steps of analyzing natural language requirements of a user, generating data processing instructions, planning a visual design scheme, and finally visually presenting an end-to-end automatic flow through a large model, so that the operation threshold of the user is reduced, and complex data analysis tasks can be completed without professional skills.
The application has a self-adaptive intelligent algorithm integrated system, and covers a hierarchical analysis algorithm, a self-optimization data cleaning algorithm, an intelligent chart recommendation algorithm, an aesthetic evaluation optimization algorithm, a personalized pattern generation algorithm, an anomaly detection and early warning algorithm, an intelligent recommendation algorithm and the like. The principle, the implementation mode and the mutual coordination of the algorithms realize the dynamic optimization of links such as data cleaning, chart recommendation, style generation and the like, and compared with the traditional method, the efficiency is improved by 60% -80%.
The application has an active intelligent decision support mechanism. The virtual data assistant combines an anomaly detection algorithm and a knowledge graph to realize functions of real-time data early warning, potential analysis dimension recommendation and automatic generation of root cause analysis reports and image-text analysis reports, and constructs a closed loop of data visualization and intelligent decision making to assist a user to quickly find problems, locate reasons and formulate strategies.
The application has a multi-mode interaction and personalized adaptation scheme. The system supports multi-mode input modes such as voice, annotation and the like, and based on user historical behavior data, a technical scheme of generating a personalized visual pattern by utilizing a visual generation countermeasure network (V-GAN) is utilized, so that the requirements of differentiated users are met, and the user experience and analysis efficiency are improved.
The application realizes the incremental update strategy of updating only the related module by establishing the nodes of the dependency graph mark affected by the correction instruction in the large model interaction module based on the incremental update of the dependency graph and the multi-round dialogue optimization, and realizes the multi-round dialogue optimization mechanism of accurately understanding the correction instruction by carrying out the semantic similarity calculation of the historical dialogue and the current instruction based on the transform architecture and the multi-head attention mechanism.
The scheme of the application can realize the following effects:
(1) The user threshold is reduced, a non-professional user does not need to master complex data processing and visual design knowledge, and the required visual bulletin board and report can be obtained only through natural language description requirements, so that the usability of data analysis visualization is greatly improved.
(2) Compared with the traditional manual method for making the signboard and the report, the method can quickly respond to the demands of the user, automatically complete the whole process from data processing to visual generation, shorten the work originally needing days to a few minutes or even shorter, greatly improve the efficiency of data analysis and help enterprises to quickly make decisions.
(3) The flexibility is enhanced, the large model has strong understanding capability, can process diversified user demands, is simple single-index data visualization and is complex multi-dimensional data analysis demands, the system can accurately understand and generate corresponding visualization results, and the data analysis visualization demands of different users in different scenes are met.
(4) And the data value utilization is improved, more users can conveniently perform data analysis and visualization, the wide application of the data in enterprises is promoted, more potential values behind the data are mined, and powerful data support is provided for the development of the enterprises.
The foregoing describes various methods of embodiments of the present application. An apparatus for carrying out the above method is further provided below.
Referring to fig. 2, the embodiment of the application further provides a big data analysis visualization device based on a big model, which includes:
the receiving module 21 is configured to receive a natural language requirement input by a user, where the natural language requirement includes at least one information of text, voice, and chart interactive labeling instruction;
The first processing module 22 is configured to decompose a semantic layer, a logic layer and a task layer of the natural language requirement by adopting a preset hierarchical analysis model, so as to generate an executable data analysis instruction;
an obtaining module 23, configured to obtain target data required by the data analysis instruction;
A second processing module 24, configured to generate a basic visual chart according to the data analysis instruction, the target data, a preset chart recommendation algorithm and an aesthetic evaluation optimization algorithm;
and the third processing module 25 is configured to generate, through visualization, a user preference feature for learning over-the-network, dynamically adjust the basic visualization chart according to the user preference feature, and generate a personalized visualization result corresponding to the natural language requirement.
Optionally, the big data analysis visualization device based on big model further includes:
The fourth processing module is used for monitoring the original data flow associated with the personalized visual result in real time through the virtual data assistant module, calculating the abnormal score of the time sequence predicted value and the actual value, triggering abnormal early warning when the abnormal score meets the condition that N continuous periods deviate from a preset value or the ring ratio increase rate exceeds a preset threshold value, and generating an abnormal event record;
The fifth processing module is used for acquiring the weight of the abnormal event and the image-text report corresponding to the personalized visual result under the condition of detecting triggering abnormal early warning, and generating a multi-mode decision report based on a multi-mode large model;
the updating module is used for collecting feedback data of the user according to the multi-mode decision report and updating the preset chart recommendation algorithm into a target algorithm;
And the generation module is used for regenerating a target visual result according to the target algorithm.
Optionally, the fifth processing module includes:
the first processing unit is used for inputting the abnormal event record into a preset index association network to determine the weight of the abnormal event when the triggering abnormal early warning is detected, wherein the index association network is constructed through a time sequence attention mechanism and a graph neural network according to the text log, the time sequence data and the image data;
A first determining unit, configured to map the weight of the abnormal event into a weight vector;
the second determining unit is used for determining text embedding characteristics and visual characteristics according to the image-text report;
The generation unit is used for splicing the weight vector, the text embedded feature and the visual feature into a multi-modal input sequence, inputting the multi-modal input sequence into a multi-modal large model based on pre-training, and generating a multi-modal decision report.
Optionally, the first processing module 22 includes:
The second processing unit is used for extracting semantic layer characteristics of the natural language requirements by adopting a BERT layer in the preset hierarchical analysis model and combining a conditional random field, and extracting a key entity set comprising a data time range, an analysis object, an analysis action type and a visual form;
The third processing unit is used for carrying out logic layer conversion by adopting a graph traversal algorithm based on the domain knowledge graph in the preset hierarchical analysis model and the key entity set to generate a target pseudo code comprising data screening conditions, aggregation dimension and output form;
And the fourth processing unit is used for decomposing a task layer based on the module dependency graph and the target pseudo code in the preset hierarchical analysis model to generate an executable target instruction corresponding to task generation or task correction.
Optionally, the second processing module 24 includes:
The fifth processing unit is used for constructing a histogram or a thermodynamic diagram based on a visual mapping knowledge base and a decision tree recommendation algorithm according to the data analysis instruction and the target data, wherein the visual mapping knowledge base comprises a triple mapping relation established by index types and analysis targets and chart types;
The sixth processing unit is used for optimizing the evaluation index of the histogram or thermodynamic diagram based on a multi-objective genetic algorithm for aesthetic evaluation optimization to determine an optimal solution set, wherein the evaluation index comprises the symmetry of the histogram or thermodynamic diagram, the color contrast and the information density;
And the seventh processing unit is used for iteratively adjusting the coordinate axis distance, the color ring angle and the font size parameter of the histogram or thermodynamic diagram by utilizing the optimal solution set to generate a basic visual chart.
Optionally, the third processing module 25 includes:
The system comprises an extraction unit, a theme modeling unit and a storage unit, wherein the extraction unit is used for extracting style feature vectors of historical visual works through theme modeling based on a user operation history log, wherein the style feature vectors comprise high-frequency color values, font use preference and component layout density features;
the first construction unit is used for constructing a visualized generation countermeasure network according to the style characteristic vector;
an eighth processing unit, configured to inject parameters output by the generator of the countermeasure network into the base visualization chart based on the visualization generation, perform theme color mapping, font dynamic replacement, and responsive layout reconstruction, and generate a target result;
And the ninth processing unit is used for verifying the visual consistency of the generation style corresponding to the target result and the user historical works based on the identifier of the visual generation countermeasure network, and generating the personalized visual result corresponding to the natural language requirement after the consistency is passed.
Optionally, the above-mentioned obtaining module 23 includes:
the second construction unit is used for constructing a self-optimizing data cleaning pipeline based on the reinforcement learning framework;
a tenth processing unit, configured to obtain preset data required by the data analysis instruction, and clean the preset data according to the self-optimized data cleaning pipeline, where the data cleaning includes filling a missing value in a numerical field of the preset data by using a K nearest neighbor interpolation method, and filling a missing value in a classification field of the preset data by using a sliding window mode;
the detection unit is used for carrying out data cleaning on the preset data and then carrying out abnormal value joint detection;
and the third determining unit is used for determining that the data passing verification is the target data under the condition that the verification rate of the abnormal value joint detection is larger than a preset threshold value.
The device in this embodiment corresponds to the above-described method, and the implementation manner in each of the above-described embodiments is applicable to the embodiment of the device, so that the same technical effects can be achieved. The device provided by the embodiment of the application can realize all the method steps realized by the embodiment of the method and can achieve the same technical effects, and the parts and the beneficial effects which are the same as those of the embodiment of the method in the embodiment are not described in detail.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the big data analysis visualization method embodiment based on the big model, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The embodiment of the application also provides a computer program product, which comprises computer instructions, wherein the computer instructions realize the processes of the big data analysis visualization method embodiment based on the big model when being executed by a processor, and can achieve the same technical effects, and the repetition is avoided, so that the description is omitted.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.