CN120336505A

CN120336505A - Test question management method based on multimodal adaptive similarity learning

Info

Publication number: CN120336505A
Application number: CN202510434754.6A
Authority: CN
Inventors: 刘雄华; 张�杰; 杨洋; 陈德; 何顶新; 闵捷; 姚珊珊; 李娟�; 刘海; 刘婷婷
Original assignee: Wuhan Technology and Business University
Current assignee: Wuhan Technology and Business University
Priority date: 2025-04-08
Filing date: 2025-04-08
Publication date: 2025-07-18

Abstract

The application discloses a test question management method based on multi-mode self-adaptive similarity learning, which relates to the technical field of data management, and comprises the steps of collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and fusing the variation test question set to form a test question ecological library, so that the diversity and coverage of test question resources are enriched; the method comprises the steps of inputting a preset semantic similarity model and a preset syntax similarity model into a test question ecological library respectively to obtain semantic similarity score and syntax similarity score of each test question, which is beneficial to comprehensively measuring the similarity of the test questions and improving screening accuracy, obtaining multi-mode similarity score according to the semantic similarity score and the syntax similarity score to compare with a preset threshold value, and further respectively importing each test question into an auditing module or a question library storage module according to a comparison result, which is beneficial to realizing intelligent distribution of the test questions and ensuring timeliness and quality of the question library.

Description

Test question management method based on multi-mode self-adaptive similarity learning

Technical Field

The application relates to the technical field of data management, in particular to a test question management method based on multi-mode self-adaptive similarity learning.

Background

With the rapid development of artificial intelligence and big data technology, intelligent transformation in the education field is rapidly advancing, and intelligent question banks play a key role in the process. Through automated management test question data, the intelligent question bank not only optimizes the distribution of educational resources, but also promotes the development of personalized teaching, and the intelligent question bank plays an important role in improving teaching quality, ensuring educational fairness and meeting different educational demands.

Although the intelligent question bank management technology has made remarkable progress, challenges are still faced in the aspects of accurate evaluation of the similarity of the questions and optimization of the resource warehousing process when facing massive question data sets. Therefore, in order to ensure timeliness and accuracy of the question bank resources and reduce the redundancy of knowledge points, it is necessary to perform multi-mode similarity analysis on the questions and refine the question warehousing process.

Disclosure of Invention

The application mainly aims to provide a test question management method based on multi-mode self-adaptive similarity learning, and aims to solve the technical problems that in the existing test question management process, the test question similarity is difficult to accurately evaluate, and the update timeliness and the test question quality of test questions in a test question library are affected.

In order to achieve the above purpose, the present application provides a test question management method based on multi-mode adaptive similarity learning, the method comprising:

collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and carrying out fusion treatment on the original test question set, the evolution test question set and the variation test question set to form a test question ecological library;

Respectively inputting the test question ecological library into a preset semantic similarity model and a preset syntax similarity model to obtain semantic similarity scores and syntax similarity scores of all the test questions in the test question ecological library;

And obtaining a multi-modal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multi-modal similarity score with a preset threshold value, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result.

In an embodiment, the step of collecting an original test question set, generating an evolution test question set by a preset large language model, generating a variation test question set by a data enhancement technology, and performing fusion processing on the original test question set, the evolution test question set and the variation test question set to form a test question ecological library includes:

Setting a data acquisition range, and acquiring an original test question set in a data source based on the data acquisition range;

Constructing a prompting template containing semantic constraint conditions, and guiding a preset large language model to generate an evolution test question set based on an original test question set through the prompting template;

Expanding and mutating the original test question set by a data enhancement mode of synonym replacement, sentence pattern recombination and knowledge point expansion to generate a variation test question set;

and fusing the original test question set, the evolution test question set and the variation test question set to obtain fused test question data, and performing de-duplication treatment on the fused test question data to form a standardized test question ecological base.

In an embodiment, the step of inputting the test question ecological base into a preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological base includes:

inputting the test question ecological library into a preset semantic similarity model, wherein the preset semantic similarity model comprises a SemGloVe module, a preset BERT module, a graph neural network module and a similarity scoring module;

in SemGloVe module, establishing inter-vocabulary semantic association of test question text through word co-occurrence matrix analysis, and extracting word-level similar features by combining with an attention mechanism to construct a word similarity matrix;

Integrating word similarity matrixes through a preset BERT module to perform multi-level semantic characterization, and generating word embedding vectors with semantic information;

Constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating sentence vectors based on adjacency node information aggregation in the adjacency matrix;

And in the similarity scoring module, sequentially processing the sentence vectors through the full-connection layer and the Softmax layer to obtain the semantic similarity score of each test question.

In one embodiment, in the SemGloVe module, the step of establishing inter-vocabulary semantic association of the test question text through word co-occurrence matrix analysis and extracting word-level similarity features in combination with an attention mechanism to construct a word similarity matrix includes:

Acquiring test question texts in the test question ecological library, acquiring the vocabulary frequency and the vocabulary co-occurrence relation of each vocabulary in the test question texts through a word co-occurrence matrix, and constructing a global word co-occurrence count matrix;

Average aggregation is carried out on the attention weights corresponding to the codes of the bytes under the words to obtain the attention weights corresponding to the words;

based on the attention weight corresponding to each vocabulary, determining semantic association among the vocabularies through a Division distance function, and obtaining a word similarity matrix.

In an embodiment, the step of integrating the word similarity matrix through the preset BERT module to perform multi-level semantic representation and generating the word embedding vector with semantic information includes:

Integrating the word similarity matrix into a multi-head attention mechanism of the BERT model to calculate corresponding attention weights through input expression vectors;

And integrating the attention weights corresponding to the attention heads to obtain attention output, and performing linear transformation on the attention output to obtain word embedding vectors with semantic information.

In one embodiment, the step of constructing, by the graph neural network module, an adjacency matrix according to the word embedding vector, and aggregating and generating sentence vectors based on adjacency node information in the adjacency matrix includes:

constructing an adjacency matrix according to the word embedding vector, and constructing a graph structure representation according to the adjacency matrix;

Adding corresponding relative position codes to each node in the graph structure representation to obtain a graph structure with position information;

performing graph convolution operation based on the graph structure with the position information to aggregate adjacent node information of each node and obtain updated node embedding;

and aggregating the updated nodes for embedding to generate sentence vectors.

In an embodiment, the step of inputting the test question ecological base into a preset syntax similarity model to obtain the syntax similarity score of each test question in the test question ecological base includes:

inputting the test question ecological library into a preset syntax similarity model, wherein the preset syntax similarity model comprises a Chinese word segmentation module, a part-of-speech judgment and terminology module and a PT tree core module;

through the Chinese word segmentation module, the test question text is segmented into word sequences according to key words and context in the test question information;

Each word in the word sequence is labeled in part of speech through the part of speech judging and glossary module, and each labeled word is glossary expressed to obtain a plurality of glossary triples;

In the PT tree core building module, a grammar analysis tree CPT is built based on each term triplet, node similarity is calculated through PT cores, and a grammar similarity score is obtained according to a similarity calculation result.

In an embodiment, the Chinese word segmentation module is constructed based on a CNN-BiGRU-CRF composite neural network model, and the step of segmenting the test question text into word sequences according to the key words and the context in the test question information by the Chinese word segmentation module includes:

mapping the text sequence into a word vector matrix through the CNN layer and the embedding layer to obtain the word vector matrix corresponding to the test question text;

capturing context information in test question text through BIGRU layers, and converting the word vector matrix into sequence representation;

And obtaining the dependency relationship of each label in the sequence representation through a CRF layer so as to optimize the prediction result of the sequence labeling task and obtain the word sequence.

In an embodiment, the step of labeling the parts of speech of each word in the word sequence by the part of speech judging and naming module, and naming the labeled words to obtain a plurality of term triples includes:

Performing part-of-speech tagging on each word in the word sequence by adopting a natural language processing tool kit to obtain a word sequence after part-of-speech tagging;

Confirming a plurality of key terms based on the word sequence after part-of-speech tagging, and performing concept mapping on each key term to obtain a term list after the concept mapping;

based on the term list after concept mapping, constructing semantic vector representation of each key term through a dynamic vector space model;

Calculating the semantic similarity among the key terms, and integrating the key terms according to the semantic similarity among the key terms to form a joint term set, wherein the joint term set consists of a plurality of term triples.

In an embodiment, the step of obtaining a multi-modal similarity score according to the semantic similarity score and the syntax similarity score, comparing the multi-modal similarity score with a preset threshold, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result includes:

Nonlinear fusion is carried out on the semantic similarity score and the syntactic similarity score based on a dynamic weighting mechanism, and a multi-modal similarity score is generated;

Comparing the multi-modal similarity score with a preset threshold;

If the multi-mode similarity score is higher than the preset threshold, the test questions are imported into an auditing module;

and if the multi-mode similarity score is not higher than the preset threshold, importing the test questions into a question bank storage module.

The application discloses a test question management method based on multi-mode self-adaptive similarity learning, which comprises the steps of collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set through a data enhancement technology, carrying out fusion processing on the original test question set, the evolution test question set and the variation test question set to form a test question ecological base, respectively inputting the test question ecological base into a preset semantic similarity model and a preset syntax similarity model to obtain semantic similarity scores and syntax similarity scores of all test questions in the test question ecological base, obtaining multi-mode similarity scores according to the semantic similarity scores and the syntax similarity scores, comparing the multi-mode similarity scores with a preset threshold value, and respectively guiding all the test questions into an auditing module or a question base storage module according to comparison results. The application constructs the test question ecological library by fusing the original test question set, the evolution test question set and the variation test question set, enriches the diversity and coverage of the test question resources, adopts a semantic similarity model and a syntax similarity model to evaluate the semantic similarity and the syntax similarity of the test questions respectively, comprehensively measures the similarity of the test questions, improves the screening accuracy, realizes the intelligent distribution of the test questions by comparing the multi-modal similarity score with a preset threshold value, and ensures the timeliness and the quality of the test question library.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a first embodiment of a test question management method based on multi-mode adaptive similarity learning according to the present application;

FIG. 2 is a schematic diagram of an original test question set collection process;

FIG. 3 is a schematic diagram of an adaptive similarity fusion process;

FIG. 4 is a schematic flow chart of a second embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application;

FIG. 5 is a schematic diagram of a module structure of the semantic similarity model SemBert-GCN;

FIG. 6 is a schematic flow chart of a third embodiment of a test question management method based on multi-modal adaptive similarity learning according to the present application;

FIG. 7 is a schematic block diagram of a syntactic similarity model TE-PTK;

fig. 8 is a full-flow schematic diagram of a test question management method based on multi-mode adaptive similarity learning according to the present application.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

Referring to fig. 1, fig. 1 is a flow chart of a first embodiment of a test question management method based on multi-mode adaptive similarity learning, in this embodiment, the method includes steps S10 to S30:

step S10, collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and carrying out fusion treatment on the original test question set, the evolution test question set and the variation test question set to form a test question ecological base.

It should be noted that, in the scenario of building an intelligent question bank, the execution body of the embodiment may be a computing service device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, a question bank server, or other electronic devices capable of implementing functions such as a graph or the like.

It should be understood that the original test question set may be public test question data collected from a network, for example, the original test question data is obtained from a structured data source such as a university official network, a MOOC platform and the like through an automatic crawler technology, which includes specific technical means such as API interface call, webpage analysis (BeautifulSoup/Selenium) and the like.

It will be appreciated that a data collection range may be preset to ensure that data collection is within the framework of legal, ethical and efficient. For example, to ensure legitimacy, explicit authorization of the data owner or associated regulatory agency must be obtained. Meanwhile, the robots. Txt protocol should be strictly adhered to, focusing on web pages of public information, avoiding touching sensitive or proprietary data. The storage and use of data should ensure security and transparency and is limited to predefined purposes only.

Specifically, referring to fig. 2, fig. 2 is a schematic diagram of an original test question set collection process. As shown in FIG. 2, when the original test question set is collected, a web page crawling technology can be used to collect the test question corpus of artificial intelligence, a script system is adopted to acquire links of related program pages in the process, an HTML parser such as WebDriver and Python libraries (Requests and BeautifulSoup) of Selenium is used to parse web page contents and organize the web page contents into a structured data frame, after test question information extraction is completed, the test question information is compiled into a single data frame, and the compiled data frame is stored as a file in CSV format, wherein the file size is about 10MB. The rate of data acquisition is set to acquire every 5-10 seconds, taking into account the load of the server.

Then, the collected original test question data can be subjected to data processing so as to ensure the quality and uniformity of the generated corpus. Therefore, a standardized term system can be introduced to perform standardization processing on the program name, the department name, the course code and other core entities, so that confusion caused by naming differences is eliminated, and the consistency and comparability of data are enhanced. In addition, a series of data consistency verification processes can be implemented, through careful comparison of cross verification and original website information, irregular and conflict information in a corpus is effectively identified and corrected, and accuracy and completeness of acquiring original test question data are ensured.

Furthermore, the test questions can be finely classified and labeled in multiple labels according to the properties, types, the belonging fields, specific knowledge points and other dimensions of the test questions, so as to form an original test question Set _Ori with 'gene coding'.

It should be noted that the preset large language model may be constructed based on general LLMs such as GPT-4 and PaLM or a special model based on corpus fine adjustment in the education field. A Set of refined prompting templates can be Set for guiding LLMs to mine the original test questions and the known information in the original test questions so as to form an evolutionary test question Set _evo.

By way of example, the specific requirements of the prompt module may be 1) objectively and accurately answer known information about a given test question, 2) ensure that your answer is refined and complete, no more than four sentences, 3) generate new test questions containing important semantic information, 4) not change knowledge point features covered by the original test questions when generating new test questions, the term "objectively accurate" is intended to reduce hallucinations and ambiguities in LLMs, the purpose of claim 3 is to keep the test question semantic information generated by LLMs uncorrupted, and to emphasize that the answer should not be irrelevant information.

The data enhancement technique can include synonym substitution, sentence pattern recombination, conditional variation and other operations. The original test question Set is controllably modified by the different technical enhancement means, so that the data diversity of the original test question Set can be enhanced, and a variation test question Set _mut is obtained.

In the data enhancement process, the method can not only conduct difficulty gradient adjustment, condition change and situation simulation on the test questions under the same knowledge point to generate hierarchical variants, but also create new test question situations by introducing elements crossing disciplines or combining temporal hot spots, and further widen coverage of the test questions. In addition, the expression of sentences can be modified by using synonym and paraphrase replacement strategies, the diversity of sentences is increased, the sentence word sequence and the sentence structure are changed through sentence exchange and random exchange, and the presentation mode of test questions is enriched.

It should be appreciated that the knowledge of the test questions can be mined in depth using a large language model as the "natural choice" engine. The data augmentation technology is introduced to simulate genetic variation, so that the diversity of test question data can be enriched. The original test question set, the evolution test question set generated by the large model and the variation test question set after data enhancement are fused, so that a test question ecological library with a structure can be formed.

Further, the question ecological base can be standardized by data cleaning and filtering technology. This process involves identifying and eliminating those candidate keywords that fit a particular pattern, such as removing HTML tags, eliminating non-text characters (e.g., special symbols and emoticons), trimming excess space and line breaks, and normalizing the text encoding format. In addition, proper processing of numbers and special formats is included, as well as performing stop word filtering.

The method comprises the steps of firstly setting a data acquisition range, acquiring an original test question set in a data source based on the data acquisition range, then constructing a prompting template containing semantic constraint conditions, guiding a preset large language model through the prompting template to generate an evolution test question set based on the original test question set, expanding and mutating the original test question set through a data enhancement mode of synonym replacement, sentence pattern recombination and knowledge point expansion to generate a variation test question set, and finally fusing the original test question set, the evolution test question set and the variation test question set to obtain fused test question data, and performing deduplication processing on the fused test question data to form a standardized test question ecological base.

And S20, respectively inputting the test question ecological base into a preset semantic similarity model and a preset syntax similarity model to obtain the semantic similarity score and the syntax similarity score of each test question in the test question ecological base.

It should be noted that, the test question ecological library may be respectively put into a preset semantic similarity model SemBert-GCN and a preset syntax similarity model TE-PTK, so as to obtain the similarity of the test questions on the semantic and syntax structure levels.

It should be understood that the semantic similarity model SemBert-GCN is used for obtaining the similarity score S _g of the test question at the semantic level, and the syntactic similarity model TE-PTK is used for obtaining the similarity score S _s of the test question at the semantic level.

Wherein SemBert-GCN may include SemGloVe module, a preset BERT module and a graphic neural network (GCN) module. The SemGloVe module takes a standardized and structured test question ecological library as input, maps the vocabulary to a high-dimensional vector space to capture global semantic relations among the words, and the preset BERT module can generate a score matrix containing rich semantic features and token embedding by integrating word similarity matrices. The GCN module is embedded with a scoring matrix and a token carrying semantic information to obtain vector representation of similar test questions, and then the vector representation is subjected to pooling activation processing to obtain a semantic similarity score S _q representing sentences.

The TE-PTK model can comprise a Chinese word segmentation module, a part-of-speech judgment and terminology module and a PT tree core module. The Chinese word segmentation module takes a multi-attribute label test question information sequence as input, adopts a CNN-BIGRU-CRF network model, and cuts an gapless character sequence into a word sequence with definite meaning and boundary according to key words and context in the test question information. The part-of-speech judging and terminology module can automatically assign corresponding grammar function labels to each vocabulary by utilizing Stanford corenlp equal word segmentation tools so as to realize accurate recognition of the part-of-speech. In order to solve the ambiguity problem of text terms, the terms can be conceptualized by adopting a CN-Probase knowledge graph developed by a complex denier university knowledge factory laboratory, and the terms with useless information can be screened and removed, so that the definition and accuracy of the terms in sentences can be ensured. The method can use terms as basic semantic units to construct a selected-area analysis tree of the short text, enhance the semantic expressive force of the words, and enable an taker to calculate the number of public substructures between the two trees by using PT cores so as to obtain a syntactic similarity score S _s of the test question text.

And step S30, obtaining a multi-modal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multi-modal similarity score with a preset threshold value, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result.

It should be understood that after the semantic similarity Score S _g and the syntax similarity Score S _s are obtained, the contribution of different similarity scores can be dynamically adjusted by using the weight coefficients α, β, δ based on the adaptive equalization policy technology, and an interaction term Φ (S _g,S_s) is introduced to capture the interaction between the different similarity scores, so as to realize the effective fusion of the semantic similarity and the syntax similarity, thereby calculating the multi-modal similarity Score of the test question. Referring to fig. 3, fig. 3 is a schematic diagram of an adaptive similarity fusion process, and based on fig. 3, the calculation formula of the multi-modal similarity score may be:

Score=Softmax(α·σ(S_g)+β·σ(S_s)+δ·φ(S_g,S_s,S_k)) (0)

Where α, β, δ are weight coefficients for balancing the importance of different similarity scores and interaction terms and satisfy α+β+δ=1. Sigma (x) is a Sigmoid function for limiting the score between 0 and 1, phi (S _g,S_s) for capturing interactions between different modality scores can be defined as Tanh (mu ₁·S_g+μ₂·S_s),μ₁,μ₂ is a parameter for adjusting interactions; complex relationships between different modality similarities are captured by means of the Tanh function, taking into account not only the individual contributions of each modality similarity in this formula.

It should be noted that the preset threshold may be a preset similarity threshold μ. Based on the multi-mode similarity Score, the dynamic threshold value distribution module can be used for effectively screening the warehousing qualification of the test questions. If the multi-modal similarity Score of the test question exceeds the preset similarity threshold μ, the test question will be accurately imported into the test question auditing module for further evaluation or "extinction". Otherwise, the data is included in the question bank for storage.

In a specific implementation, the semantic similarity score and the syntactic similarity score are subjected to nonlinear fusion based on a dynamic weighting mechanism to generate a comprehensive similarity score, the comprehensive similarity score is compared with a preset threshold, the test questions are imported into an auditing module if the comprehensive similarity score is higher than the preset threshold, and the test questions are imported into a question bank storage module if the comprehensive similarity score is not higher than the preset threshold.

According to the method, the device and the system, the test question ecological library is built by fusing the original test question set, the evolution test question set and the variation test question set, diversity and coverage of test question resources are enriched, semantic similarity and syntactic similarity of the test questions are respectively estimated by adopting a semantic similarity model and a syntactic similarity model, similarity of the test questions is comprehensively measured, screening accuracy is improved, intelligent distribution of the test questions is achieved through comparison of multi-mode similarity scores and a preset threshold value, and timeliness and quality of the test question library are guaranteed.

In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On this basis, please refer to fig. 4, fig. 4 is a flowchart illustrating a second embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application.

In this embodiment, in order to illustrate a specific process of obtaining a semantic similarity score through a preset semantic similarity model, the step of inputting the test question ecological base into the preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological base includes steps S201 to S205:

Step S201, inputting the test question ecological base into a preset semantic similarity model, wherein the preset semantic similarity model comprises a SemGloVe module, a preset BERT module, a graph neural network module and a similarity scoring module.

It should be understood that the structure of the preset semantic similarity model SemBert-GCN of the present embodiment may be described herein in conjunction with fig. 5, and fig. 5 is a schematic diagram of the module structure of the semantic similarity model SemBert-GCN.

In the step S202, in the SemGloVe module, inter-vocabulary semantic association of test question text is established through word co-occurrence matrix analysis, and word level similarity features are extracted by combining an attention mechanism to construct a word similarity matrix.

It should be understood that SemGloVe model takes standardized "test question ecological library" as input, captures vocabulary frequency and co-occurrence relation between them by analyzing global Word-to-Word co-occurrence count matrix X, in order to understand relation between words more intuitively, converts original BPE-to-BPE attention weight into Word-to-Word attention weight, the conversion process average and aggregate attention weights of all BPE marks belonging to the same Word, thus obtaining attention weight AW e R ^K×K of each Word to other words, and determines distance Div (w _i,w_j) between target Word w _i and context Word w _j by using Division distance function.

The data processing procedure of the SemGloVe module can be also described herein with reference to fig. 5, and the data processing procedure of the SemGloVe module is as follows:

First, semGloVe takes a standardized "test question ecological library" as input, and captures the vocabulary frequency and the co-occurrence relation between them by analyzing the global Word-to-Word co-occurrence count matrix X. Its term X _i,j represents the total number of occurrences of a Word w _j e V in the context of a Word w _i e V, where V is the vocabulary of the training corpus, for the separation of global Word-to-Word co-occurrence counts w _i and w _j (i.e. their distance between texts), semGloVe specifies that X _i,j is:

Where P _i and P _j are locations in the context, the closer the words to w _i the greater the weight obtained.

Next, for a given word sequence w= { W ₁,…,w_k }, BERT converts each word (or fragment of a word) into BPE (Byte Pair Encoding) tags, because BERT, when processing text, breaks down words into multiple BPE units or byte pairs for encoding in order to capture finer semantic information. In order to more intuitively understand the relationship between words, the SemGloVe module converts the original BPE-to-BPE attention weight into Word-to-Word attention weight, and the conversion process carries out average aggregation on the attention weights of all BPE marks belonging to the same Word, so as to obtain the attention weight of each Word to other words. The generated Word-to-Word attention weights AW ε R ^K×K, for the local window context w _j of Word w _i, j ε [ i-S, i+S ], the attention weights AW _i,j from Word w _i to Word w _j can be expressed as:

Where m and n are the number of subwords of w _i and w _j, respectively, and AT (k, l) represents the attention weight from BPE markers t _k to t ₁. AW _i,j is arranged in descending order and the first s words are selected as context words C (w _i) of w _i to exclude semantically nonsensical words.

Finally, the distance Div (w _i,w_j) between the target word w _i and the context word w _j is determined by a Division distance function.

In a specific implementation, the SemGloVe module can acquire test question texts in the test question ecological library, acquire the vocabulary frequency and the vocabulary co-occurrence relation of each vocabulary in the test question texts through the word co-occurrence matrix, construct a global word co-occurrence count matrix, average aggregate attention weights corresponding to codes of each byte under each vocabulary to acquire the attention weights corresponding to each vocabulary, and accordingly determine semantic association among the vocabularies through a Division distance function based on the attention weights corresponding to each vocabulary to acquire a word similarity matrix.

Step S203, integrating the word similarity matrix through a preset BERT module to perform multi-level semantic characterization, and generating word embedding vectors with semantic information.

It should be understood that the preset BERT module may be a post-fine-tuning BERT model, by fine-tuning a baseline BERT model, the term similarity matrix S constructed by the evolutionary population S ₁ and the original population S ₂ generated by LLMs is integrated into a multi-head attention mechanism of BERT, and the similarity matrix S is used as additional input information, so that the BERT model can evaluate semantic relationships among terms more accurately in its self-attention layer, thereby improving the understanding ability of the model on complex semantic structures. The multi-headed attention mechanism of BERT generates an output vector MultiHead (Θ, K, ψ) by linearly transforming the query (Θ), key (K) and value (ψ) vectors, applying the scaled dot product attention, and stitching and linearly transforming again the results of the multiple "heads". Since the model injects a word similarity matrix S to calculate Hadamard products, BERT Attention using scaled dot product calculations can be denoted as Attention (Θ, K, ψ).

The data processing procedure of the preset BERT module can be described here in conjunction with fig. 5, and the data processing procedure of the preset BERT module is as follows:

Firstly, through fine tuning of a BERT model of a base line, a word similarity matrix S= { p _1,1,…p_i,j…p_l,l(S₁,S₂) constructed by an original population S ₁＝{p₁,…p_i…p_l(S₁) and an evolutionary population S ₂＝{p₁,…p_i…p_l(S₂ generated by LLMs is integrated into a multi-head attention mechanism of the BERT, and by taking the similarity matrix S as additional input information, the BERT model can evaluate semantic relations among words more accurately in a self-attention layer of the BERT model, so that the understanding capability of the model on complex semantic structures is improved. The multi-headed attention mechanism of BERT works by linearly transforming the query (Θ), key (K) and value (ψ) vectors, then applying scaled dot product attention, finally, the results of the plurality of "heads" are spliced and linearly transformed again to generate an output vector, which can be expressed as:

MultiHead(Θ,K,Ψ)=Concat(head₁,…,head_h)W^O (5)

Wherein the method comprises the steps of AndThe parameter matrixes corresponding to the ith attention head query, key and value are respectively shown, and W ^O is a weight matrix when the ith attention head is spliced.

Finally, since the model injects the word similarity matrix S to calculate Hadamard products, the model is focused more on word pairs with higher similarity in sentence pairs, BERT attention is expressed as:

Scores=ΘK^T*S+MASK (6)

in a specific implementation, the preset BERT module can integrate the word similarity matrix into a multi-head attention mechanism of the BERT model to calculate corresponding attention weights through input expression vectors, integrate the attention weights corresponding to all attention heads to obtain attention output, and perform linear transformation on the attention output to obtain word embedding vectors with semantic information.

And S204, constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating sentence vectors based on the adjacency node information aggregation in the adjacency matrix.

It should be appreciated that the word embedding h _i for each token carrying semantic information obtained by the BERT is input and passed to the subsequent GCN model. Unlike the standard GCN model, the score matrix score is used as an adjacency matrix A _i,j, each token is used as each node in the GCN, and relative position codes are added into the GCN so that the relative position information of the tokens can be learned. Based on the adjacency matrix A _i,j, for a given node i, the GCN gathers the relevant semantic information that its context word carries in A _i,j and represents it by computing the output of node iAfter being processed by the GCN module, the vector of each token is obtainedThe vector h _s of the sentence is obtained after Average pooling.

The data processing procedure of the graphic neural network (GCN) module can be described herein with reference to fig. 5, and the data processing procedure of the GCN module is as follows:

Firstly, taking the word embedding h _i of each token carrying semantic information obtained by BERT as input, and transmitting the word embedding h _i into a subsequent GCN model. Unlike the standard GCN model, the score matrix score is used as an adjacency matrix A _i,j, each token is used as each node in the GCN, and relative position codes are added into the GCN so that the relative position information of the tokens can be learned. Based on the adjacency matrix A _i,j, for a given node i, the GCN gathers the relevant semantic information that its context word carries in A _i,j and represents it by computing the output of node i

Secondly, after being processed by a GCN module, the vector of each token is obtainedThe vector of the sentence is obtained after Average pooling:

Where m is the total number of tokens, i.e. the length of the sentence sequence.

In a specific implementation, the GCN module can construct an adjacency matrix according to the word embedding vector and construct a graph structure representation according to the adjacency matrix, add corresponding relative position codes to nodes in the graph structure representation to obtain a graph structure with position information, perform graph convolution operation based on the graph structure with the position information to aggregate adjacent node information of the nodes to obtain updated node embeddings, aggregate the updated node embeddings and generate sentence vectors.

Step S205, in the similarity scoring module, the sentence vectors are sequentially processed through a full-connection layer and a Softmax layer, so that semantic similarity scores of all the test questions are obtained.

In a specific implementation, the sentence vector h _s obtained after pooling is processed through a full connection layer and then passes through a Softmax layer, and finally the sentence semantic similarity score S _g is obtained.

In the embodiment, the semantic similarity model captures global semantic relations of words, semantic association among words and semantic information of sentences layer by layer through a SemGloVe module, a post-fine-tuning BERT module and a GCN module, so that a semantic similarity score is finally output. The method is favorable for accurately evaluating the semantic similarity of the test questions, capturing deep semantic relations, screening out high-quality and non-repeated test questions, improving the quality of a question bank and optimizing intelligent group paper.

In the third embodiment of the present application, the same or similar contents as those of the first, second or the first embodiment can be referred to the description above, and the description is omitted. On this basis, please refer to fig. 6, fig. 6 is a flowchart illustrating a third embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application.

In this embodiment, in order to illustrate a specific process of obtaining a syntax similarity score through a preset syntax similarity model, the step of inputting the test question ecological base into the preset syntax similarity model to obtain the syntax similarity score of each test question in the test question ecological base includes steps S206-S209:

step S206, inputting the test question ecological base into a preset syntax similarity model, wherein the preset syntax similarity model comprises a Chinese word segmentation module, a part-of-speech judgment and glossary module and a PT tree core module.

It should be understood that, here, the structure of the preset syntax similarity model TE-PTK of the present embodiment may be described in conjunction with fig. 7, and fig. 7 is a schematic diagram of the module structure of the syntax similarity model TE-PTK.

And S207, segmenting the test question text into word sequences according to the key words and the context in the test question information through the Chinese word segmentation module.

It should be noted that, the Chinese word segmentation module adopts a CNN-BIGRU-CRF network model to execute word segmentation tasks of the test question information sequence, and the Chinese word segmentation module mainly comprises a CNN layer, a BIGRU layer and a CRF layer.

It should be noted that the chinese word segmentation module takes the multi-attribute tag test question information sequence as input, adopts the CNN-BIGRU-CRF network model to perform chinese word segmentation, and cuts the non-interval character sequence into word sequence y (M) with definite meaning and boundary according to the key word and context in the test question information.

The data processing procedure of the chinese word segmentation module can be described herein with reference to fig. 7, and the data processing procedure of the chinese word segmentation module is as follows:

Firstly, the CNN layer is combined with the embedding layer to map the text sequence into a word vector matrix C E R _k×d, wherein each word vector x _i∈R_k×d, k is the sentence length, d is the word vector dimension, and deep features E= [ h ₁,h₂,…h_n ] in the text data can be effectively identified and captured by utilizing the self-adaptive learning capability of the convolutional network feature extractor.

Secondly, the Bi-GRU layer adopts a bidirectional gating circulation unit network to accurately capture the time sequence dependence of the text, including long-term dependence and context information. Through Bi-GRU layer processing, deep feature vector E= [ h ₁,h₂,…h_n ] in text data is converted into a sequence expressed as M= { M ₁,…,m_i,…,m_n }, wherein M _i is the ith input character vector of the CRF layer.

Finally, the CRF layer optimizes the prediction result of the sequence labeling task by considering the dependency relationship among the labels in the sequence. Let the tag sequence be y= { Y ₁,y₂…y_n }, where Y (M) is the set of tag sequences for M, the CRF model conditional probability P (y|m) is expressed as:

Wherein the phi (Y _t-1,y_t, M) function is used to calculate the score of the input tag sequence Y, A representing the tag transfer score matrix, For a score going from tag y _i to y _i+1, the larger its value, the greater the transition probability; score matrix output for Bi-GRU network Y _i th output matrix, specifically denoted as the i-th character), matrixThe corresponding loss function obtained by calculating the maximum likelihood estimate through CRF can be expressed as n x k:

Wherein N is the number of the trained marked sentences, and Y _i is the real tag sequence of the ith sentence.

In the specific implementation, a Chinese word segmentation module maps a text sequence into a word vector matrix through a CNN layer and an embedding layer to obtain a word vector matrix corresponding to a test question text, captures context information in the test question text through a BIGRU layer, converts the word vector matrix into a sequence representation, and obtains the dependency relationship of each label in the sequence representation through a CRF layer to optimize the prediction result of a sequence labeling task to obtain a word sequence.

And step S208, marking the parts of speech of each word in the word sequence through the part of speech judging and glossary module, and glossary representing each marked word to obtain a plurality of glossary triples.

It should be understood that, after the accurate word segmentation result is obtained, the part-of-speech judging and naming module may make language comments such as part-of-speech labels for the text through Stanfordcorenlp toolkit, and define conceptual classes and instance classes for distinguishing noun terms. For the characteristic that short text mainly consists of terms, semantic vectors are constructed through a dynamic vector space model, term set sums are extracted from short text sums, terms are expressed in the form of triples (terms, part of speech labels and concepts), and the terms are combined to form a joint term set. Semantic vectors for each term are constructed based on term similarity, while smoothing inverse frequency is employed as an attention weight to emphasize the different contributions of different terms to the meaning of the short text.

The data processing procedure of the part-of-speech judging and naming module can be described herein with reference to fig. 7, and the data processing procedure of the part-of-speech judging and naming module is as follows:

Firstly, the part of speech judgment and terminology module can use the toolkit Stanfordcorenlp1 of natural language processing to determine the part of speech of each word after obtaining more accurate word segmentation result by using the Chinese word segmentation module. And may generate language notes for the text by calling corenlp library, including part-of-speech tagging, boundary recognition of sentences and tags, named entity recognition, quotation properties, and their relationship recognition, etc. To explicitly distinguish the parts of speech of noun terms, type (t) is defined to divide terms into conceptual and example classes:

Where E _t/C_t is a set of instances/concepts at time t, freq (E)/freq (c) is the frequency ratio when instances/concepts, |·| is the number of terms in the set.

Next, it is reasonable and effective to use terms to express the meaning of a test question, considering that short text is mainly composed of terms. Aiming at the sparsity problem of the short text, a dynamic vector space model is introduced to construct the semantic vector of the short text. The term sets T ₁ and T ₂ are extracted from two short texts S ₁ and S ₂, respectively, each term being represented in the term set in the form of a triplet (term, part of speech notation, concept), and then the semantic vectors of S ₁ and S ₂ are constructed. The term sets T ₁ and T ₂ are first combined to form a joint term set T, and then semantic vectors are constructed for each term in T based on term similarity. Taking S ₁ as an example, the calculation method of the ith dimension of the semantic vector is as follows:

semantic_vector_i＝sim(vector₁,vector₂)*W₁*W₂ (14)

If the vector does not belong to T ₁, calculating the semantic similarity of each item in the vector and T ₁, selecting the score of the item with the highest similarity score of T ₁ and the vector, and if the vector belongs to T ₁, the score is 1. Where sim (vector ₁,vector₂) is the cosine similarity function of the two vectors.

Finally, since different terms have different contributions to the short text, e.g., the stop word has a smaller contribution to the overall meaning of the short text, the attention weight W emphasizes the term's contribution with a smooth inverse frequency SIF, S F can be expressed by the following formula:

wherein, the For the smoothing parameter, L (node) is the word frequency of the node.

In a specific implementation, a part-of-speech judgment and terminology module carries out part-of-speech labeling on each word in a word sequence by adopting a natural language processing tool kit to obtain a word sequence after part-of-speech labeling, confirms a plurality of key terms based on the word sequence after part-of-speech labeling, carries out conceptual mapping on each key term to obtain a term list after conceptual mapping, constructs semantic vector representation of each key term through a dynamic vector space model based on the term list after conceptual mapping, calculates semantic similarity among each key term, integrates each key term according to the semantic similarity among each key term to form a joint term set, and the joint term set consists of a plurality of term triples.

In step S209, in the module for constructing PT tree core, a syntax analysis tree CPT is constructed based on each term triplet, node similarity is calculated through PT core, and a syntax similarity score is obtained according to the similarity calculation result.

It should be understood that the PT tree kernel building module may accurately build the CPT of the short text by using terms as basic semantic units, calculate the similarity of the corresponding nodes in the CPT by using the PTK tree kernel, and accumulate and normalize the similarity scores of all the nodes to obtain a score S _s that comprehensively reflects the syntactic structure similarity of two sentences.

The data processing procedure for constructing the PT tree core module can be also described herein with reference to fig. 7, and the data processing procedure for constructing the PT tree core module is as follows:

First, a syntax parsing tree (CPT) can be constructed by constructing the PT tree core module, the CPT can reveal phrase structures in short texts and hierarchical syntax relationships thereof, and the tree structure effectively reveals structural information of the texts. Whereas a single word may not be sufficient to convey complete semantics and a short text may not fully follow standard written grammar, traditional word segmentation methods may not be sufficient to guarantee the accuracy of the CPT. Therefore, CPT of short text can be constructed more accurately using terms as basic semantic units.

Next, the syntax similarity is calculated using PT cores, and similarly to other tree core calculation methods, a PT tree core function PTK between the trees T ₁ and T ₂ is defined as:

(16)

Wherein, T ₁ and T ₂ are CPT of S ₁ and S ₂ respectively, AndFor the node sets in T ₁ and T ₂, Δ (node ₁,node₂) is the number of common fragments of the root at node ₁ and node ₂, which are the cores of the tree cores. The evaluation of the common PT rooted at nodes node1 and node2 requires the selection of a shared subset of two nodes, which are generated using a sub-sequence kernel method in view of their order of importance in the syntactic structure.

In the PT core, Δ (node ₁,node₂) is represented by the following formula:

where alpha and beta are attenuation factors, alpha is the height of the tree, beta is the length of the subsequence, AndIs the ordered subsequence of node ₁ and node ₂, p _min returnsAndThe minimum sequence length between Δp calculates the number of common subtrees of the root in the p sub-sequences. Avoiding introducing too much noise, the semantic information is integrated into the tree core, and when node ₁ and node ₂ are both leaf nodes, similarity is calculated using equation (14) to consider the semantic information.

To better understand the above equation, the solution is found by constructing a recursive function Δ _p:

Wherein n ₁ and n ₂ are each AndIs represented by n ₁ and n ₂, respectively, n ₁ and n ₂, n ₁ [1:i ] and n ₂ [1:r ] represent the sub-sequences from 1 to i in n ₁ and from 1 to r in n ₁, respectively, and delta _p-1 is calculated recursively, stopping when a leaf node is reached.

It will be appreciated that to calculate the syntactic similarity of two questions, CPT is first constructed for each sentence and then the similarity of the corresponding nodes in the trees is calculated using equation (17). And then accumulating and normalizing the similarity scores of all the nodes to obtain a composite reflecting two sentence syntactic structure similarity S _s.

In a specific implementation, the PT tree core building module can build a corresponding grammar analysis tree CPT for each test question sentence based on each term triplet, calculate the similarity of corresponding nodes in each grammar analysis tree CPT through a preset node common segment number calculation formula, normalize the similarity of each node, and obtain the syntax similarity score between each test questions.

In the embodiment, the syntax similarity model processes test question information layer by layer through a Chinese word segmentation module, a part-of-speech judging and glossary module and a PT tree core building module, from word sequence to glossary, and then to similarity calculation of a syntax structure, and finally outputs a syntax similarity score. The method is beneficial to focusing on the syntactic structure of the test questions, accurately evaluating the structural similarity, identifying potential structural problems, enhancing the expression diversity of the question bank and assisting in personalized recommendation.

In addition, the whole flow of the present application will be described with reference to fig. 8, and fig. 8 is a schematic diagram of the whole flow of the test question management method based on multi-mode adaptive similarity learning according to the present application.

As can be seen from fig. 8, the whole process can be divided into a preliminary stage, an evolution stage, a screening stage and an extinction stage.

In the initial construction stage, artificial intelligent test question corpus information is acquired from network open source resources, and the test question information is endowed with clear gene codes through fine classification and multi-label labeling, so that an original test question set is obtained.

In the evolution stage, an evolution test question set is generated through a preset large language model, a data enhancement technology is adopted to generate a variation test question set, and then the original test question set, the evolution test question set and the variation test question set are fused to form a test question ecological base.

And then, carrying out text preprocessing on the test question data in the test question ecological library, and respectively inputting the text preprocessing to a preset semantic similarity model SemBert-GCN and a preset syntactic similarity model TE-PTK. In SemBert-GCN, sequentially passing through SemGloVe module, preset BERT module and graphic neural network (GCN) module, outputting semantic similarity score S _g, and in TE-PTK, sequentially passing through Chinese word segmentation module, part-of-speech judgment and terminology module, constructing PT tree core module, and outputting syntax similarity score S _s.

The semantic similarity Score S _g and the syntactic similarity Score S _s are input into the self-adaptive similarity fusion module ASIM, and a self-adaptive equalization strategy is adopted to realize fusion evaluation on the multi-modal characteristics of the test questions, so that the multi-modal similarity Score of the test questions is obtained.

In the screening stage, the dynamic threshold value distribution module can effectively screen the warehousing qualification of the test questions on the basis of the multi-mode similarity Score obtained in the preamble stage. If the multi-mode similarity score of the test question exceeds the preset similarity threshold mu, the test question is accurately imported into a test question auditing module to be further evaluated or deleted. Otherwise, it is included in the question bank.

Based on the whole flow of the method, the semantic and syntactic information of the test questions can be deeply fused, the similarity of the test questions is accurately measured, repeated and low-quality test questions are automatically screened, the updating and quality of the content of the question bank are ensured, and further, the accurate screening and quality improvement of the test question resources are realized. Thereby optimizing the structure of the question bank and better meeting the high standard requirement of the intelligent development of education.

It should be noted that the foregoing examples are only for understanding the present application, and do not constitute a limitation of the method for managing questions based on multi-modal adaptive similarity learning of the present application, and it is within the scope of the present application to perform more forms of simple transformation based on the technical idea.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are only for describing, not representing, and are only some embodiments of the present application, not limiting the scope of the present application, and all equivalent structural changes made by the description of the present application and the accompanying drawings under the technical concept of the present application, or direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A test question management method based on multimodal adaptive similarity learning, characterized in that the method comprises:

Collecting the original test question set, generating an evolved test question set by using a preset large language model, generating a variant test question set by using data enhancement technology, and fusing the original test question set, the evolved test question set and the variant test question set to form a test question ecological library;

Inputting the test question ecological library into a preset semantic similarity model and a preset syntactic similarity model respectively, to obtain a semantic similarity score and a syntactic similarity score of each test question in the test question ecological library;

A multimodal similarity score is obtained according to the semantic similarity score and the syntactic similarity score, and the multimodal similarity score is compared with a preset threshold value. According to the comparison result, each of the test questions is imported into a review module or a question bank storage module.

2. The method according to claim 1, characterized in that the steps of collecting the original test question set, generating the evolved test question set by using a preset large language model, generating the variant test question set by using data enhancement technology, and fusing the original test question set, the evolved test question set and the variant test question set to form a test question ecological library include:

Setting a data collection scope, and collecting an original test question set from a data source based on the data collection scope;

Constructing a prompt template including semantic constraints, and guiding a preset large language model to generate an evolved test question set based on the original test question set through the prompt template;

The original test question set is expanded and mutated by data enhancement methods such as synonym replacement, sentence reorganization and knowledge point expansion to generate a mutated test question set;

The original test question set, the evolved test question set and the variant test question set are integrated to obtain integrated test question data, and the integrated test question data is deduplicated to form a standardized test question ecological library.

3. The method according to claim 1, characterized in that the step of inputting the test question ecological library into a preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological library comprises:

Inputting the test question ecological library into a preset semantic similarity model, wherein the preset semantic similarity model includes a SemGloVe module, a preset BERT module, a graph neural network module, and a similarity scoring module;

In the SemGloVe module, the semantic association between words in the test text is established through word co-occurrence matrix analysis, and the word-level similarity features are extracted in combination with the attention mechanism to construct a word similarity matrix;

The word similarity matrix is integrated through the preset BERT module to perform multi-level semantic representation and generate word embedding vectors with semantic information;

Constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating a sentence vector based on the aggregation of adjacent node information in the adjacency matrix;

In the similarity scoring module, the sentence vectors are processed in sequence through the fully connected layer and the Softmax layer to obtain the semantic similarity score of each test question.

4. The method according to claim 3, characterized in that the step of establishing the semantic association between the words of the test text through word co-occurrence matrix analysis in the SemGloVe module, and extracting word-level similarity features in combination with the attention mechanism to construct a word similarity matrix, comprises:

Obtain the test text in the test ecological library, and obtain the vocabulary frequency and vocabulary co-occurrence relationship of each vocabulary in the test text through the word co-occurrence matrix, and construct a global word co-occurrence count matrix;

The attention weights corresponding to the encodings of the byte pairs under each of the words are averaged and aggregated to obtain the attention weights corresponding to the words;

Based on the attention weights corresponding to the words, the semantic associations between the words are determined through the Division distance function to obtain a word similarity matrix.

5. The method according to claim 3, characterized in that the step of integrating the word similarity matrix by presetting the BERT module to perform multi-level semantic representation and generate a word embedding vector with semantic information comprises:

Integrate the word similarity matrix into the multi-head attention mechanism of the BERT model to calculate the corresponding attention weights through the input representation vector;

The attention weights corresponding to each attention head are integrated to obtain the attention output, and the attention output is linearly transformed to obtain a word embedding vector with semantic information.

6. The method according to claim 3, characterized in that the step of constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and aggregating and generating a sentence vector based on adjacent node information in the adjacency matrix comprises:

Constructing an adjacency matrix according to the word embedding vectors, and constructing a graph structure representation according to the adjacency matrix;

Performing a graph convolution operation based on the graph structure with position information to aggregate adjacent node information of each node to obtain updated node embedding;

Aggregate the updated node embeddings to generate a sentence vector.

7. The method according to claim 1, characterized in that the step of inputting the test question ecological library into a preset syntactic similarity model to obtain the syntactic similarity score of each test question in the test question ecological library comprises:

Input the test question ecological library into a preset syntactic similarity model, wherein the preset syntactic similarity model includes: a Chinese word segmentation module, a part-of-speech judgment and terminology module, and a PT tree core module;

By means of the Chinese word segmentation module, the test text is segmented into word sequences according to the key words and context in the test information;

Performing part-of-speech tagging on each word in the word sequence through the part-of-speech judgment and terminology module, and performing terminology representation on each tagged word to obtain a plurality of term triples;

In the PT tree core construction module, a grammar parse tree CPT is constructed based on each of the term triples, and node similarity is calculated through the PT core, and a syntactic similarity score is obtained according to the similarity calculation result.

8. The method according to claim 7, characterized in that the Chinese word segmentation module is constructed based on a CNN-BiGRU-CRF composite neural network model, and the step of segmenting the test text into word sequences according to key words and context in the test information by the Chinese word segmentation module comprises:

The text sequence is mapped into a word vector matrix through the CNN layer combined with the embedding layer to obtain the word vector matrix corresponding to the test text;

Capturing contextual information in the test text through a BIGRU layer, and converting the word vector matrix into a sequence representation;

The dependency relationship between the labels in the sequence representation is obtained through the CRF layer to optimize the prediction result of the sequence labeling task and obtain the word sequence.

9. The method according to claim 7, wherein the step of tagging each word in the word sequence with a part of speech by the part of speech judgment and terminology module, and terminologically representing each tagged word to obtain a plurality of term triples comprises:

Using a natural language processing toolkit to perform part-of-speech tagging on each word in the word sequence to obtain a word sequence after part-of-speech tagging;

Confirming a number of key terms based on the word sequence after the part-of-speech tagging, and performing concept mapping on each of the key terms to obtain a term list after concept mapping;

Based on the term list after the concept mapping, constructing the semantic vector representation of each key term through a dynamic vector space model;

The semantic similarities between the key terms are calculated, and the key terms are integrated according to the semantic similarities between the key terms to form a joint term set, wherein the joint term set is composed of a plurality of term triples.

10. The method according to claim 1, characterized in that the step of obtaining a multimodal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multimodal similarity score with a preset threshold, and importing each of the test questions into a review module or a question bank storage module according to the comparison result comprises:

Nonlinearly fusing the semantic similarity score and the syntactic similarity score based on a dynamic weighting mechanism to generate a multimodal similarity score;

Comparing the multimodal similarity score with a preset threshold;

If the multimodal similarity score is higher than the preset threshold, the test question is imported into the review module;

If the multimodal similarity score is not higher than the preset threshold, the test question is imported into the question bank storage module.