CN119169643B

CN119169643B - A method for analyzing and judging the rationality of architecture diagrams based on multimodal feature fusion

Info

Publication number: CN119169643B
Application number: CN202411679083.1A
Authority: CN
Inventors: 聂志锋; 张琳; 赵瑞; 范军; 王祎童; 金杨; 房璐; 张廷彪
Original assignee: Beijing Big Data Center
Current assignee: Beijing Big Data Center
Priority date: 2024-11-22
Filing date: 2024-11-22
Publication date: 2025-04-01
Anticipated expiration: 2044-11-22
Also published as: CN119169643A

Abstract

The invention relates to the field of reasonable judgment of intelligent architecture diagram design, and discloses a reasonable analysis and judgment method of architecture diagram based on multi-mode feature fusion, which is used for acquiring an architecture diagram to be analyzed; identifying and extracting an architecture graphic script part in an architecture graphic, recording, identifying and acquiring a feature map set of an interested region of an image element through an R-CNN model, recording, establishing a corresponding relation between text information of an OCR module and position and logic information of the image element of the R-CNN model, generating an expanded architecture graphic script part, performing word segmentation processing and semantic coding on the expanded architecture graphic script to obtain a word granularity feature vector set, performing semantic coding to obtain the word granularity feature vector set, generating comprehensive feature vectors of the same dimension, calculating semantic matching coefficients of the feature vectors, and judging whether the design of the architecture graphic to be analyzed meets the rationality requirement of integral planning. The design and the function of the architecture diagram can be effectively evaluated whether to meet the overall planning requirement, and the accuracy and the efficiency of analysis and review are improved.

Description

Architecture diagram rationality analysis and judgment method based on multi-mode feature fusion

Technical Field

The invention relates to the technical field of intelligent architecture diagram design rationality judgment, in particular to an architecture diagram rationality analysis judgment method based on multi-mode feature fusion.

Background

Architecture diagrams are widely used as a key visualization tool in a plurality of fields such as engineering, construction, information technology, project management and business. These diagrams, in combination with the figures and text, describe in detail the various components of the system, organization, process, or project and their relationship to one another in order to provide a high-level overview that helps engineers, designers, and other related personnel to more intuitively and deeply understand and communicate with the design and implementation of complex systems.

In conventional approaches, the judgment of the rationality of the architecture diagram design relies primarily on manual analysis and understanding, which requires a professional to carefully read the individual components, lines, and arrows in the architecture diagram and infer relationships and roles between them therefrom. However, this approach has significant limitations. First, the manual judgment is susceptible to subjective consciousness and experience of the individual, possibly leading to inconsistency and deviation of the judgment result. Secondly, for large-scale and complex architecture diagrams, a great deal of time and effort are required for manual judgment, and the efficiency is low. This limitation is particularly evident when dealing with architectural drawings that contain large amounts of textual information and graphical elements.

With the development of information technology, intelligent judgment of the rationality of architecture diagram design by using computer technology becomes a trend. Conventional automatic judgment methods generally rely on rules and templates, and are difficult to deal with complex and variable architecture diagrams. In recent years, with the rapid development of artificial intelligence and deep learning technologies, a multi-mode fusion technology based on a large model provides a new solution for the rationality judgment of the design of the architecture diagram. The multi-mode fusion technology can combine the characteristics of texts and images to carry out comprehensive analysis and processing, thereby remarkably improving the accuracy and efficiency of judgment.

The architecture diagram design rationality judging technology based on multi-mode fusion can convert text and image features into the same feature space by using a pre-trained large model (such as CLIP) and capture complex semantic relations between the text and the image features. The technology not only can solve the subjectivity and inefficiency problems in the traditional method, but also can process complex and changeable architecture diagrams and provide more reliable judgment results.

Therefore, there is an urgent need for a design rationality judgment scheme for optimizing architecture diagrams based on multi-mode fusion, which can overcome the limitations of the traditional method and realize efficient and accurate judgment and review of the architecture diagrams.

Disclosure of Invention

The invention aims to solve the problem that the traditional automatic judging method generally depends on rules and templates and is difficult to cope with complex and changeable structure patterns.

The invention adopts the following technical scheme for realizing the purposes:

a method for analyzing and judging the rationality of a framework graph based on multi-mode feature fusion comprises the following steps:

s1, acquiring a structure diagram to be analyzed, and acquiring image data of the structure diagram to be analyzed as input of subsequent processing;

s2, performing text recognition on the architecture diagram to be analyzed by utilizing an OCR module to obtain an architecture diagram text part, and recording the position of each text message;

S3, processing the architecture diagram to be analyzed by using an R-CNN model to obtain a set of feature diagrams of the region of interest of the image elements in the architecture diagram, and recording the position of each image element;

s4, establishing a corresponding relation between the text information of the OCR module and the position and logic information of the R-CNN model image element, and recording the association between the text information and the R-CNN model image element;

S5, converting the position and the logic relation of the image elements into text description, and combining the text description with the architecture text part to generate an expanded architecture text part;

S6, carrying out semantic coding based on word granularity on the expanded architecture text part to obtain a set of architecture text descriptor granularity feature vectors;

S7, carrying out semantic coding based on word granularity on the architecture design rationality judgment text to obtain a set of architecture design rationality judgment text feature vectors;

S8, carrying out multi-mode feature fusion by using the CLIP model to generate a comprehensive feature vector;

s9, matching the comprehensive feature vector with the text feature vector judged by the rationality of the architecture design, and calculating a semantic matching coefficient of the text feature vector;

s10, judging the comparison result of the semantic matching coefficient of the text feature vector and a preset threshold based on the comprehensive feature vector and the rationality of the architecture design, and determining whether the architecture diagram to be analyzed meets the overall planning requirement.

Further, establishing the corresponding relation between the text information of the OCR module and the position and logic information of the R-CNN model image element comprises the following steps:

Checking whether the text box recognized by each OCR module is overlapped with the image element boundary box recognized by any R-CNN model or not, and using the cross ratio as a measurement standard;

if IoU values of the text box and the image element boundary box exceed a preset threshold, recording the association between the text box and the image element boundary box;

And detecting the corresponding relation between the large bounding box and the small bounding box through the intersection relation, recording the hierarchical relation, and ensuring the hierarchical structure and the module relation in the capturing architecture diagram.

Further, the method for extracting the position information of the text and the image elements by utilizing the OCR module and the R-CNN model specifically comprises the following steps:

character recognition is carried out on the architecture diagram by using an OCR module, text parts of all architecture diagram in the architecture diagram are extracted, and the position of each text message in the architecture diagram is recorded;

Processing the architecture diagram by using an R-CNN model, identifying and acquiring a region of interest feature diagram set of the image elements, and recording the position of each image element;

and converting the position and logic information of the image element into text description according to the corresponding relation between the position and logic information established by the OCR module and the R-CNN model, and combining the text description with the framework text part to generate an expanded framework text part.

Further, extracting local semantic feature vectors of the region of interest of the image element in the architecture diagram from the feature diagram of the region of interest of the image element in the architecture diagram specifically comprises the following steps:

upsampling the image element region of interest feature map in the architecture diagram to obtain an upsampled image element region of interest feature map in the architecture diagram;

performing point convolution coding on the image element region of interest feature map in the up-sampling structure map to obtain the image element region of interest feature map in the channel modulation up-sampling structure map;

And carrying out two-dimensional convolution coding on the image element region of interest feature map in the channel modulation up-sampling structure map to obtain the image element region of interest local semantic feature vector in the structure map.

Further, the semantic coding based on word granularity is carried out on the architecture diagram text part to obtain a set of architecture diagram text descriptor granularity feature vectors, and the method specifically comprises the following steps:

After word segmentation is carried out on the framework text part, a set of framework text descriptor granularity feature vectors is obtained through a semantic encoder of a word embedding layer contained in the large model;

The semantic coding based on word granularity is carried out on the architecture design rationality judgment text to obtain a set of architecture design rationality judgment text descriptor granularity feature vectors, and the method comprises the steps of obtaining the set of architecture design rationality judgment text descriptor granularity feature vectors through a semantic encoder of a word embedding layer contained in the large model after word segmentation processing is carried out on the architecture design rationality judgment text.

Further, after word segmentation is performed on the architecture text portion, a set of granularity feature vectors of the architecture text description words is obtained through a semantic encoder of a word embedding layer contained in the large model, and the method specifically comprises the following steps:

word segmentation processing is carried out on the architecture diagram and the architecture design rationality judgment text so as to convert the architecture diagram text part into a word sequence consisting of a plurality of words;

the embedding layer of the semantic encoder comprising the word embedding layer respectively maps each word in the word sequence into a word embedding vector to obtain a sequence of the word embedding vector;

And the converter of the semantic encoder containing the word embedding layer carries out global context semantic encoding based on the converter thought on the sequence of the word embedding vectors to obtain the set of the architecture text descriptor granularity feature vectors.

Further, based on the comparison between the semantic matching coefficients between the feature sequences and a preset threshold, determining whether the architecture diagram to be analyzed meets the overall planning requirement comprises determining that the architecture diagram to be analyzed meets the overall planning requirement in response to the semantic matching coefficients between the feature sequences being greater than the preset threshold.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a method for judging the rationality of architecture diagram design based on multi-mode fusion, which is used for identifying text information in architecture diagrams, including but not limited to component names, labels, notes and the like by utilizing the OCR module (optical character recognition) capability in a large model. Further, graphic elements in the architectural drawing, such as block diagrams, lines, arrows, etc., are identified by computer vision techniques and their location and logical and pointing relationships are understood. According to the method, the marked text feature vector and the image element feature vector are subjected to multi-mode fusion to generate the comprehensive feature vector, the comprehensive feature vector is matched with the architecture diagram design rationality judgment feature vector, the semantic matching coefficient is calculated, and whether the architecture diagram to be analyzed meets the overall planning requirement is judged based on a comparison result of the matching coefficient and a preset threshold value. The method can realize the automation and the comprehensive analysis of the architecture diagram, and overcomes the problems of subjectivity and low efficiency of the traditional manual analysis.

Drawings

FIG. 1 is a flow chart of a method for performing a rationality determination for a design of a architectural diagram based on feature fusion in accordance with an embodiment of the application;

FIG. 2 is a diagram of a system architecture for performing a rationality determination for architecture design based on feature fusion in accordance with an embodiment of the application;

FIG. 3 is a block diagram of a system for performing architectural diagram design rationality determination based on feature fusion in accordance with an embodiment of the application;

fig. 4 is a diagram of a department informatization construction architecture of a production supervision department in a certain area.

Reference numeral 310, an image acquisition module, 320, a character recognition module, 330, an interested region extraction module, 335, a position and logic information processing module, 337, a position description generation module, 350, a semantic coding module, 360, a comprehensive feature vector generation module, 370, a semantic matching coefficient calculation module, 380, and a judgment result generation module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1 and 2, a method for judging the rationality of a framework graph based on multi-mode fusion comprises the following steps:

S1, acquiring a structure diagram to be analyzed;

the image data of the structure diagram to be analyzed is obtained and used as the input of the subsequent processing, such as the department informatization construction structure diagram of the production supervision department in a certain area in fig. 4.

S2, performing character recognition on the structure diagram to be analyzed by utilizing an OCR module to obtain a text part of the structure diagram and recording the position of text information;

The text recognition of the structure diagram can be automatically processed through the OCR module, the tedious process of manually recognizing text information in the structure diagram one by one is avoided, time and labor cost can be saved, the OCR module is usually trained through a large amount of data, and has high accuracy and recognition capability, and compared with manual recognition, the OCR module can recognize text content in the structure diagram more quickly and accurately. Specifically, the architecture graphic script part often contains key information such as important component names, labels, notes and the like, which is helpful for subsequent semantic understanding and comprehensive analysis and judgment of architecture diagrams. The location of each text message in the frame map is recorded, including the upper left and lower right corner coordinates (e.g., x1, y1, x2, y 2) of the text box.

S3, identifying and acquiring a region of interest feature map set of the image element by utilizing an R-CNN model, and recording the position of the image element;

in the architecture diagram, various graphic elements, such as block diagrams, lines, arrows and the like, are included, and the elements have interrelated position, logic relationship and pointing relationship. In order to capture the hidden correlation relationship between the graphic elements, the R-CNN model is used for processing the to-be-analyzed structure graph to obtain a set of the feature graphs of the interested areas of the graphic elements in the structure graph. The R-CNN model is a deep learning model for object detection, and can identify objects of interest in different areas in an image and extract features thereof. In architectural diagram parsing, the R-CNN model can be utilized to help identify various image elements in an architectural diagram, such as block diagrams, lines, arrows, etc. Therefore, the R-CNN model can automatically learn and capture the local interested areas and the implicit characteristic information of each element in the architecture diagram, which is helpful for the follow-up more accurate understanding of the layout and connection modes of each element in the architecture diagram, so that the content and structure of the architecture diagram can be more comprehensively and carefully understood, and important information is provided for the follow-up analysis and the judgment of whether the architecture diagram meets the overall planning requirement. The location of each image element is recorded, including the bounding box coordinates (e.g., x1, y1, x2, y 2) of the element.

S4, establishing a corresponding relation between the text information of the OCR module and the position and logic information of the R-CNN model image element;

For each text box identified by the OCR module, it is checked whether it overlaps with any of the R-CNN model identified image element bounding boxes. The cross-over ratio (IoU, intersection over Union) was used as a measure. The higher IoU value indicates a greater degree of overlap of the two boxes. If IoU of the text box and the image element bounding box exceeds a preset threshold (e.g., 0.5), the text information and the image element are considered to have a corresponding relationship, and the association between the text information and the image element is recorded. Completely within the large bounding box, the small bounding box is considered to be encompassed by the large bounding box. Then, the hierarchical relationship is recorded, and the relationship between the large bounding box and the small bounding box contained in the large bounding box is recorded. For example, if a large frame includes four small frames, the content corresponding to the large frame is considered to include the content corresponding to the four small frames. In this way, hierarchical and modular relationships in the architecture diagram can be captured.

S5, converting the position and logic relation of the image elements into text description;

And converting the position and logic information of the image element into text description according to the corresponding relation between the position and logic information established by the OCR module and the R-CNN model. For example, the "service center" includes 3 modules, namely "flow and rule management", "user and authority management", "system and operation management", respectively. And combining the position and logic information description into a corresponding architecture diagram text part to generate an extended architecture diagram text part. By the method, the spatial relationship, the position and the logic information of the image elements can be converted into the text description which is easier to understand and process, and richer information support is provided for subsequent semantic analysis and comprehensive judgment.

S6, word segmentation and semantic coding are carried out on the framework text (marked text + position and logical relation text description) part, and a set of granularity feature vectors of the framework text description is obtained

In order to capture text semantic information in the architecture diagram, including important semantics such as component names, labels, notes and the like, after the architecture diagram text part is subjected to word segmentation, a semantic encoder of a word embedding layer contained in a large model is utilized to obtain a set of architecture diagram description word granularity feature vectors. The text information can be converted into a more structured and processable form by word segmentation processing on the expanded architecture text portion, and the word segmentation processing is helpful for identifying key words and phrases in the text, so that the meaning and the context of the text can be better understood. The specific method comprises the following steps:

Word segmentation is carried out on the structure image text part so as to convert the structure image text part into a word sequence composed of a plurality of words;

mapping each word in the word sequence into a word embedding vector by using an embedding layer of a semantic encoder comprising the word embedding layer to obtain a sequence of the word embedding vector;

And performing global context semantic coding on the sequence of word embedding vectors based on the thought of the converter by using a converter of a semantic coder containing a word embedding layer to obtain a set of granular feature vectors of the text description words of the architecture diagram.

In order to capture semantic information in the reasonable judgment of the design of the architecture diagram, which comprises important contents such as functional requirements, design standards, evaluation indexes and the like, the word segmentation processing is further required to be carried out on the reasonable judgment of the design of the architecture diagram, and a semantic encoder of a word embedding layer contained in a large model is utilized to obtain a set of word granularity feature vectors of the reasonable judgment of the design of the architecture diagram. The specific method comprises the following steps:

Word segmentation processing is carried out on the architecture design rationality judgment text so as to convert the architecture design rationality judgment text into a word sequence composed of a plurality of words;

mapping each word in the word sequence into a word embedding vector by using an embedding layer of a semantic encoder comprising the word embedding layer to obtain a sequence of word embedding vectors;

And performing global context semantic coding on the sequence of word embedded vectors based on the thought of the converter by using a converter of a semantic coder containing a word embedded layer, and generating a word granularity feature vector set of the architecture design rationality judgment text.

The text information for judging the rationality of the design of the architecture diagram is composed of specific judging rules, for example, in a certain project, in order to ensure the consistence of the rationality of the design of the architecture diagram and the overall planning, the rationality judging rules of the design of the architecture diagram comprise the following aspects:

1. the consistency of the architecture diagram and the text description ensures that each module, component and element in the architecture diagram is consistent with the corresponding text description.

2. Frame map rationality and task consistency-the frame map should clearly embody the positioning of planning tasks and keep consistent with the main tasks.

3. Describing the butt joint relation between the system and other relevant systems or platforms, ensuring that the butt joint is completed and meeting the requirements of relevant standards and control rules.

4. And (3) checking the integrity of the overall framework of the architecture diagram, and ensuring that the contents such as a common application supporting platform, data convergence sharing, unified portal channels and the like are clearly displayed.

And converting the image elements into feature vectors by using a CLIP pre-training model, and fusing the architecture icon text feature vectors and the image element feature vectors to generate comprehensive feature vectors. The specific method comprises the following steps:

Feature extraction, namely extracting a marked text feature vector and an image element feature vector by using a text encoder and an image encoder of the CLIP model respectively.

Feature space alignment-the CLIP model maps text and image feature vectors into the same multi-modal feature space such that text and image features with the same semantics are close in that space. In this way, the CLIP model is able to capture complex semantic relationships between text and images.

And (3) feature fusion, namely carrying out weighted fusion on the text feature vector and the image feature vector to generate a comprehensive feature vector. And weighting and fusing the marked text feature vector and the image element feature vector by using a self-attention mechanism, dynamically distributing weights according to the content of the feature vector, and capturing the complex relationship between the text and the image. Specifically, the labeled text feature vector and the image element feature vector are respectively input into a self-attention mechanism model, and the weighted fusion feature vector is obtained by calculating the attention score between each feature vector and other feature vectors. And finally, further processing the fused feature vectors through a multi-layer perceptron (MLP), and improving the richness and discrimination capability of the feature representation. The multi-layer perceptron consists of a plurality of fully connected layers, and the feature vector is subjected to nonlinear transformation through a nonlinear activation function, so that the expression capability of the feature is enhanced.

S9, matching the comprehensive feature vector with the text feature vector judged by the rationality of the architecture design, and calculating the semantic matching coefficient of the text feature vector;

in the step of matching the integrated feature vector with the architecture diagram design rationality judgment feature vector to calculate the semantic matching coefficient thereof, it is first necessary to ensure that the integrated feature vector and the architecture diagram design rationality judgment feature vector are in the same vector space. The specific method comprises the following steps:

The two feature vector sets are subjected to standardization processing to enable the two feature vector sets to have comparability on the same scale, the standardization processing can be carried out by adopting a Z-score standardization method, and the feature vectors are scaled to the same scale, so that the magnitude difference among different features is eliminated, and the accuracy of a matching result is ensured.

And evaluating the semantic matching degree of the comprehensive feature vectors and the architecture diagram design rationality judgment feature vectors by calculating the cosine similarity between the comprehensive feature vectors and the architecture diagram design rationality judgment feature vectors. Cosine similarity is a commonly used index for measuring the similarity between two vectors, and is calculated by calculating the ratio of the inner product of the two vectors to the product of the modular lengths of the two vectors. Specifically, for each integrated feature vector and architecture diagram design rationality judgment feature vector, their cosine similarity values are calculated, the range of values is between-1 and 1, and a value closer to 1 indicates a higher semantic similarity of the two vectors.

And comparing the semantic matching coefficient obtained through calculation with a preset threshold value to determine the matching degree of the comprehensive feature vector of the architecture diagram and the design rationality judgment feature vector of the architecture diagram. If the semantic matching coefficient is higher than a preset threshold, the comprehensive feature vector is considered to be matched with the design rationality judgment feature vector of the architecture diagram in a semantic manner, which indicates that the design and the function of the architecture diagram meet the overall planning requirement, and if the semantic matching coefficient is lower than the preset threshold, which indicates that mismatch exists in certain aspects, further analysis and adjustment are needed.

S10, determining whether the structure diagram to be analyzed meets the overall planning requirement or not based on the comparison result of the semantic matching coefficient of the comprehensive feature vector and the structural design rationality judgment text (namely, the review rule text) feature vector and a preset threshold;

In the step of determining whether the architecture diagram to be analyzed meets the overall planning requirement or not based on the comparison result between the semantic matching coefficients of the comprehensive feature vector and the architecture diagram design rationality judgment feature vector and the preset threshold, firstly, statistics and analysis are required to be carried out on all the semantic matching coefficients. The method comprises the following specific steps:

And designing and rationally judging semantic matching coefficients of the feature vectors for each pair of comprehensive feature vectors and the architecture diagram, comparing the semantic matching coefficients with a preset threshold value, and counting the number and proportion of feature pairs with the matching coefficients higher than the threshold value. The preset threshold is determined according to actual application scenes and requirements, and a critical value capable of effectively distinguishing matching from unmatched is generally selected to ensure accuracy and reliability of an evaluation result.

And carrying out comprehensive evaluation based on the statistical result of the matching coefficient. For matching pairs above the preset threshold, they are considered to be semantically highly consistent, indicating that the corresponding parts in the architecture diagram meet the requirements of the review rules, while for matching pairs below the preset threshold, they are considered to be semantically different, indicating that there may be parts of the architecture diagram that are not in accordance with or deviate from the review rules. By comprehensively analyzing the matching result, whether the architecture diagram is in accordance with the planning requirement on the whole and which parts are in need of further optimization and adjustment can be determined.

And finally, summarizing the evaluation results and forming a detailed review report. The review report should include statistics of overall match, portions that meet the planning requirements, portions that do not meet the planning requirements and their specific reasons, improvement suggestions, and so forth. By the method, the overall planning compliance of the architecture diagram can be clarified, targeted improvement measures can be provided, and effective adjustment and optimization of designers in subsequent work can be facilitated. Through the review process of a scientific system, the design and implementation of the architecture diagram can be ensured to better accord with the expected planning, and powerful support is provided for successful implementation of projects.

The invention has the following beneficial effects:

And in the automatic processing, text information in the architecture diagram is automatically identified through the OCR module, so that the complicated process of manually identifying one by one is avoided, and the time and labor cost are saved.

Accuracy and high efficiency, namely, the text and image element information in the architecture diagram is quickly and accurately identified by utilizing the high accuracy and identification capability of a large model through a large amount of data training.

And (3) multi-mode fusion, namely effectively fusing the feature vector of the labeling text of the architecture diagram and the feature vector of the image element by a multi-mode fusion technology to generate the feature vector containing comprehensive semantic information.

And comprehensively evaluating, namely enhancing the richness and discrimination capability of the feature representation by utilizing the self-attention mechanism, the multi-layer perceptron and other technologies, and ensuring the design and the function of the architecture diagram to meet the overall planning requirement.

And semantic matching, namely judging the semantic matching coefficient between the feature vectors by calculating the rationality of the comprehensive feature vectors and the design of the architecture diagram, comparing the semantic matching coefficient with a preset threshold value, and evaluating the design rationality of the architecture diagram by a system.

And comprehensively analyzing, namely capturing element semantic features in the architecture diagram from multiple scales and multiple layers, performing system evaluation and judgment, and providing detailed review reports, wherein the detailed review reports comprise parts meeting and not meeting planning requirements and specific reasons and improvement suggestions thereof.

The scientific support provides deep support for system design and planning, ensures that the design and the function of the architecture diagram can better accord with the expected planning, and provides powerful guarantee for successful implementation of projects.

As shown in fig. 3, the present invention further provides a system for performing rationality judgment on architecture diagram design based on feature fusion, including:

The image acquisition module 310 is used for acquiring image data of the structure diagram to be analyzed as input of subsequent processing.

The character recognition module 320 is configured to perform character recognition on the to-be-analyzed architecture diagram by using the OCR module, extract all the text parts of the architecture diagram in the architecture diagram, including component names, tags, notes, and the like, and record the location of each text message.

The region of interest extraction module 330 is configured to process the structure diagram to be analyzed by using the R-CNN model, identify and acquire a set of feature diagrams of the region of interest of the image elements, such as a block diagram, a line, an arrow, and the like, and record a position of each image element.

And the position and logic information processing module 335 is used for establishing the corresponding relation between the text information of the OCR module and the position and logic information of the R-CNN model image element. For each text box identified by the OCR module, it is checked whether it overlaps with any of the R-CNN model identified image element bounding boxes. The cross-over ratio (IoU, intersection over Union) was used as a measure. If IoU values of the text box and the image element boundary box exceed a preset threshold, the text information and the image element are considered to have a corresponding relation, and the association between the text information and the image element boundary box is recorded.

And the position description generating module 337 converts the position and logic information of the image element into text description according to the corresponding relation between the position and logic information established by the OCR module and the R-CNN model. For example, 10 modules are included under the "service center", and the "data center" is located in the middle left position of the figure. And combining the position and logic information description into a corresponding architecture diagram text part to generate an extended architecture diagram text part.

The semantic coding module 350 is configured to perform word segmentation and semantic coding on the expanded architecture diagram text portion and the architecture diagram design rationality judgment portion to obtain a word granularity feature vector set of the architecture diagram description and a word granularity feature vector set of the architecture diagram design rationality judgment description.

The comprehensive feature vector generation module 360 is configured to fuse the architecture icon text feature vector set with the image element region of interest feature vector set to generate a comprehensive feature vector. The method specifically comprises three steps of feature extraction, feature space alignment and feature fusion. The richness and discrimination capability of the feature representation are improved through a self-attention mechanism and a multi-layer perceptron (MLP) processing of the fused feature vectors.

And the semantic matching coefficient calculating module 370 is used for matching the comprehensive feature vector with the architecture diagram design rationality judging feature vector and calculating the semantic matching coefficient of the architecture diagram design rationality judging feature vector. And ensuring that the comprehensive feature vector and the architecture diagram design rationality judgment feature vector are in the same feature space so as to effectively match. And (5) calculating cosine similarity of each pair of feature vectors, and evaluating semantic matching degree of the feature vectors.

The judging result generating module 380 is configured to determine whether the structure diagram to be analyzed meets the overall planning requirement based on the comparison between the semantic matching coefficients among the feature sequences and the preset threshold. And carrying out statistics and analysis on all semantic matching coefficients, comprehensively evaluating the overall matching degree of the architecture diagram, and generating a detailed review report, wherein the detailed review report comprises overall matching degree statistics, parts meeting planning requirements, parts not meeting requirements, specific reasons, improvement suggestions and the like.

The present invention is not limited to the preferred embodiments, but the patent protection scope of the invention is defined by the claims, and all equivalent structural changes made by the specification and the drawings are included in the scope of the invention.

Claims

1. A method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion, characterized in that it comprises the following steps:

S1. Obtain the architecture diagram to be analyzed, and obtain image data of the architecture diagram to be analyzed as input for subsequent processing;

S2. Using an OCR module to perform text recognition on the architecture diagram to be analyzed, obtain the text portion of the architecture diagram, and record the position of each text information;

S3. Process the architecture diagram to be analyzed using the R-CNN model to obtain a set of feature maps of regions of interest of image elements in the architecture diagram, and record the position of each image element;

S4, establish the position and logical information correspondence between the OCR module text information and the R-CNN model image elements, and record the association between them;

S5, converting the positions and logical relationships of the image elements into text descriptions, and merging them with the text part of the architecture diagram to generate an expanded text part of the architecture diagram;

S6, performing semantic encoding based on word granularity on the expanded architecture diagram text portion to obtain a set of architecture diagram text description word granularity feature vectors;

S7, performing semantic encoding based on word granularity on the architecture design rationality judgment text to obtain a set of architecture design rationality judgment text feature vectors;

S8, use CLIP model to perform multimodal feature fusion and generate comprehensive feature vector;

The specific method is as follows:

Feature extraction: Use the CLIP model to extract the feature vectors of the image elements in the region of interest in the architecture diagram;

Feature space alignment: The CLIP model maps the set of granular feature vectors of the text description words in the architecture diagram and the feature vectors of the image element regions of interest in the architecture diagram to the same multimodal feature space, so that text and image features with the same semantics are close in this space;

Feature fusion: weighted fusion of the set of granular feature vectors of the text description words in the architecture diagram and the feature vectors of the image elements in the region of interest in the architecture diagram to generate a comprehensive feature vector;

S9, matching the comprehensive feature vector with the architecture design rationality judgment text feature vector, and calculating the semantic matching coefficient;

S10. Based on the comparison result of the semantic matching coefficient of the text feature vector and the rationality of the architecture design judgment text feature vector and the preset threshold, determine whether the architecture diagram to be analyzed meets the overall planning requirements.

2. According to the method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion according to claim 1, it is characterized in that: establishing the corresponding relationship between the position and logical information of the OCR module text information and the image elements of the R-CNN model includes the following steps:

For each text box recognized by the OCR module, check whether it overlaps with any image element bounding box recognized by the R-CNN model, using the intersection-over-union ratio as the metric;

If the IoU value between the text box and the image element bounding box exceeds the preset threshold, the association between them is recorded;

The correspondence between the large bounding box and the small bounding box is detected through the intersection relationship, and the hierarchical relationship is recorded to ensure that the hierarchical structure and module relationship in the architecture diagram are captured.

3. According to claim 2, a method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion is characterized in that the location information of text and image elements is extracted using an OCR module and an R-CNN model, specifically comprising the following steps:

Use the OCR module to perform text recognition on the architecture diagram, extract all the text parts in the architecture diagram, and record the position of each text information in the architecture diagram;

Use the R-CNN model to process the architecture diagram, identify and obtain the feature map set of the region of interest of the image element, and record the position of each image element;

According to the position and logical information correspondence established between the OCR module and the R-CNN model, the position and logical information of the image elements are converted into text descriptions and merged with the text part of the architecture diagram to generate the expanded text part of the architecture diagram.

4. According to claim 3, a method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion is characterized in that: extracting a local semantic feature vector of an image element region of interest in the architecture diagram from a feature map of an image element region of interest in the architecture diagram specifically comprises the following steps:

Upsampling the feature map of the region of interest of the image element in the architecture diagram to obtain the feature map of the region of interest of the image element in the upsampled architecture diagram;

Performing point convolution coding on the feature map of the region of interest of the image element in the upsampling architecture diagram to obtain the feature map of the region of interest of the image element in the channel modulation upsampling architecture diagram;

Two-dimensional convolution coding is performed on the feature map of the image element region of interest in the channel modulation upsampling architecture diagram to obtain a local semantic feature vector of the image element region of interest in the architecture diagram.

5. According to claim 4, a method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion is characterized in that: the text part of the architecture diagram is semantically encoded based on word granularity to obtain a set of word granularity feature vectors describing the text of the architecture diagram, specifically comprising the following steps:

After word segmentation processing is performed on the text part of the architecture diagram, a set of granular feature vectors of the word description of the architecture diagram text is obtained through a semantic encoder of the word embedding layer included in the large model;

The text of the rationality judgment of the architecture design is semantically encoded based on word granularity to obtain a set of feature vectors of the word granularity describing the text of the rationality judgment of the architecture design, including: after word segmentation processing is performed on the text of the rationality judgment of the architecture design, a semantic encoder of the word embedding layer included in the large model is passed to obtain a set of feature vectors of the word granularity describing the text of the rationality judgment of the architecture design.

6. According to claim 5, a method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion is characterized in that: after word segmentation processing is performed on the text portion of the architecture diagram, a set of granular feature vectors of the word description words of the architecture diagram text is obtained through a semantic encoder of the word embedding layer contained in the large model, specifically comprising the following steps:

Performing word segmentation processing on the architecture diagram and the architecture design rationality judgment text to convert the architecture diagram text portion into a word sequence consisting of multiple words;

Using the embedding layer of the semantic encoder including the word embedding layer, each word in the word sequence is mapped into a word embedding vector to obtain a sequence of word embedding vectors;

The converter including the semantic encoder of the word embedding layer performs global context semantic encoding on the sequence of the word embedding vectors based on the converter idea to obtain a set of granular feature vectors of the text description words of the architecture diagram.

7. According to claim 6, a method for analyzing and judging the rationality of an architecture diagram based on multimodal feature fusion is characterized in that: based on a comparison between a semantic matching coefficient between feature sequences and a preset threshold, determining whether the architecture diagram to be analyzed meets the overall planning requirements includes: in response to the semantic matching coefficient between the feature sequences being greater than the preset threshold, determining that the architecture diagram to be analyzed meets the overall planning requirements.