[go: up one dir, main page]

CN120336505A - Test question management method based on multimodal adaptive similarity learning - Google Patents

Test question management method based on multimodal adaptive similarity learning

Info

Publication number
CN120336505A
CN120336505A CN202510434754.6A CN202510434754A CN120336505A CN 120336505 A CN120336505 A CN 120336505A CN 202510434754 A CN202510434754 A CN 202510434754A CN 120336505 A CN120336505 A CN 120336505A
Authority
CN
China
Prior art keywords
test question
word
similarity
module
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510434754.6A
Other languages
Chinese (zh)
Inventor
刘雄华
张�杰
杨洋
陈德
何顶新
闵捷
姚珊珊
李娟�
刘海
刘婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Technology and Business University
Original Assignee
Wuhan Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Technology and Business University filed Critical Wuhan Technology and Business University
Priority to CN202510434754.6A priority Critical patent/CN120336505A/en
Publication of CN120336505A publication Critical patent/CN120336505A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Educational Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Mathematical Analysis (AREA)
  • Animal Behavior & Ethology (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a test question management method based on multi-mode self-adaptive similarity learning, which relates to the technical field of data management, and comprises the steps of collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and fusing the variation test question set to form a test question ecological library, so that the diversity and coverage of test question resources are enriched; the method comprises the steps of inputting a preset semantic similarity model and a preset syntax similarity model into a test question ecological library respectively to obtain semantic similarity score and syntax similarity score of each test question, which is beneficial to comprehensively measuring the similarity of the test questions and improving screening accuracy, obtaining multi-mode similarity score according to the semantic similarity score and the syntax similarity score to compare with a preset threshold value, and further respectively importing each test question into an auditing module or a question library storage module according to a comparison result, which is beneficial to realizing intelligent distribution of the test questions and ensuring timeliness and quality of the question library.

Description

Test question management method based on multi-mode self-adaptive similarity learning
Technical Field
The application relates to the technical field of data management, in particular to a test question management method based on multi-mode self-adaptive similarity learning.
Background
With the rapid development of artificial intelligence and big data technology, intelligent transformation in the education field is rapidly advancing, and intelligent question banks play a key role in the process. Through automated management test question data, the intelligent question bank not only optimizes the distribution of educational resources, but also promotes the development of personalized teaching, and the intelligent question bank plays an important role in improving teaching quality, ensuring educational fairness and meeting different educational demands.
Although the intelligent question bank management technology has made remarkable progress, challenges are still faced in the aspects of accurate evaluation of the similarity of the questions and optimization of the resource warehousing process when facing massive question data sets. Therefore, in order to ensure timeliness and accuracy of the question bank resources and reduce the redundancy of knowledge points, it is necessary to perform multi-mode similarity analysis on the questions and refine the question warehousing process.
Disclosure of Invention
The application mainly aims to provide a test question management method based on multi-mode self-adaptive similarity learning, and aims to solve the technical problems that in the existing test question management process, the test question similarity is difficult to accurately evaluate, and the update timeliness and the test question quality of test questions in a test question library are affected.
In order to achieve the above purpose, the present application provides a test question management method based on multi-mode adaptive similarity learning, the method comprising:
collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and carrying out fusion treatment on the original test question set, the evolution test question set and the variation test question set to form a test question ecological library;
Respectively inputting the test question ecological library into a preset semantic similarity model and a preset syntax similarity model to obtain semantic similarity scores and syntax similarity scores of all the test questions in the test question ecological library;
And obtaining a multi-modal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multi-modal similarity score with a preset threshold value, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result.
In an embodiment, the step of collecting an original test question set, generating an evolution test question set by a preset large language model, generating a variation test question set by a data enhancement technology, and performing fusion processing on the original test question set, the evolution test question set and the variation test question set to form a test question ecological library includes:
Setting a data acquisition range, and acquiring an original test question set in a data source based on the data acquisition range;
Constructing a prompting template containing semantic constraint conditions, and guiding a preset large language model to generate an evolution test question set based on an original test question set through the prompting template;
Expanding and mutating the original test question set by a data enhancement mode of synonym replacement, sentence pattern recombination and knowledge point expansion to generate a variation test question set;
and fusing the original test question set, the evolution test question set and the variation test question set to obtain fused test question data, and performing de-duplication treatment on the fused test question data to form a standardized test question ecological base.
In an embodiment, the step of inputting the test question ecological base into a preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological base includes:
inputting the test question ecological library into a preset semantic similarity model, wherein the preset semantic similarity model comprises a SemGloVe module, a preset BERT module, a graph neural network module and a similarity scoring module;
in SemGloVe module, establishing inter-vocabulary semantic association of test question text through word co-occurrence matrix analysis, and extracting word-level similar features by combining with an attention mechanism to construct a word similarity matrix;
Integrating word similarity matrixes through a preset BERT module to perform multi-level semantic characterization, and generating word embedding vectors with semantic information;
Constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating sentence vectors based on adjacency node information aggregation in the adjacency matrix;
And in the similarity scoring module, sequentially processing the sentence vectors through the full-connection layer and the Softmax layer to obtain the semantic similarity score of each test question.
In one embodiment, in the SemGloVe module, the step of establishing inter-vocabulary semantic association of the test question text through word co-occurrence matrix analysis and extracting word-level similarity features in combination with an attention mechanism to construct a word similarity matrix includes:
Acquiring test question texts in the test question ecological library, acquiring the vocabulary frequency and the vocabulary co-occurrence relation of each vocabulary in the test question texts through a word co-occurrence matrix, and constructing a global word co-occurrence count matrix;
Average aggregation is carried out on the attention weights corresponding to the codes of the bytes under the words to obtain the attention weights corresponding to the words;
based on the attention weight corresponding to each vocabulary, determining semantic association among the vocabularies through a Division distance function, and obtaining a word similarity matrix.
In an embodiment, the step of integrating the word similarity matrix through the preset BERT module to perform multi-level semantic representation and generating the word embedding vector with semantic information includes:
Integrating the word similarity matrix into a multi-head attention mechanism of the BERT model to calculate corresponding attention weights through input expression vectors;
And integrating the attention weights corresponding to the attention heads to obtain attention output, and performing linear transformation on the attention output to obtain word embedding vectors with semantic information.
In one embodiment, the step of constructing, by the graph neural network module, an adjacency matrix according to the word embedding vector, and aggregating and generating sentence vectors based on adjacency node information in the adjacency matrix includes:
constructing an adjacency matrix according to the word embedding vector, and constructing a graph structure representation according to the adjacency matrix;
Adding corresponding relative position codes to each node in the graph structure representation to obtain a graph structure with position information;
performing graph convolution operation based on the graph structure with the position information to aggregate adjacent node information of each node and obtain updated node embedding;
and aggregating the updated nodes for embedding to generate sentence vectors.
In an embodiment, the step of inputting the test question ecological base into a preset syntax similarity model to obtain the syntax similarity score of each test question in the test question ecological base includes:
inputting the test question ecological library into a preset syntax similarity model, wherein the preset syntax similarity model comprises a Chinese word segmentation module, a part-of-speech judgment and terminology module and a PT tree core module;
through the Chinese word segmentation module, the test question text is segmented into word sequences according to key words and context in the test question information;
Each word in the word sequence is labeled in part of speech through the part of speech judging and glossary module, and each labeled word is glossary expressed to obtain a plurality of glossary triples;
In the PT tree core building module, a grammar analysis tree CPT is built based on each term triplet, node similarity is calculated through PT cores, and a grammar similarity score is obtained according to a similarity calculation result.
In an embodiment, the Chinese word segmentation module is constructed based on a CNN-BiGRU-CRF composite neural network model, and the step of segmenting the test question text into word sequences according to the key words and the context in the test question information by the Chinese word segmentation module includes:
mapping the text sequence into a word vector matrix through the CNN layer and the embedding layer to obtain the word vector matrix corresponding to the test question text;
capturing context information in test question text through BIGRU layers, and converting the word vector matrix into sequence representation;
And obtaining the dependency relationship of each label in the sequence representation through a CRF layer so as to optimize the prediction result of the sequence labeling task and obtain the word sequence.
In an embodiment, the step of labeling the parts of speech of each word in the word sequence by the part of speech judging and naming module, and naming the labeled words to obtain a plurality of term triples includes:
Performing part-of-speech tagging on each word in the word sequence by adopting a natural language processing tool kit to obtain a word sequence after part-of-speech tagging;
Confirming a plurality of key terms based on the word sequence after part-of-speech tagging, and performing concept mapping on each key term to obtain a term list after the concept mapping;
based on the term list after concept mapping, constructing semantic vector representation of each key term through a dynamic vector space model;
Calculating the semantic similarity among the key terms, and integrating the key terms according to the semantic similarity among the key terms to form a joint term set, wherein the joint term set consists of a plurality of term triples.
In an embodiment, the step of obtaining a multi-modal similarity score according to the semantic similarity score and the syntax similarity score, comparing the multi-modal similarity score with a preset threshold, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result includes:
Nonlinear fusion is carried out on the semantic similarity score and the syntactic similarity score based on a dynamic weighting mechanism, and a multi-modal similarity score is generated;
Comparing the multi-modal similarity score with a preset threshold;
If the multi-mode similarity score is higher than the preset threshold, the test questions are imported into an auditing module;
and if the multi-mode similarity score is not higher than the preset threshold, importing the test questions into a question bank storage module.
The application discloses a test question management method based on multi-mode self-adaptive similarity learning, which comprises the steps of collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set through a data enhancement technology, carrying out fusion processing on the original test question set, the evolution test question set and the variation test question set to form a test question ecological base, respectively inputting the test question ecological base into a preset semantic similarity model and a preset syntax similarity model to obtain semantic similarity scores and syntax similarity scores of all test questions in the test question ecological base, obtaining multi-mode similarity scores according to the semantic similarity scores and the syntax similarity scores, comparing the multi-mode similarity scores with a preset threshold value, and respectively guiding all the test questions into an auditing module or a question base storage module according to comparison results. The application constructs the test question ecological library by fusing the original test question set, the evolution test question set and the variation test question set, enriches the diversity and coverage of the test question resources, adopts a semantic similarity model and a syntax similarity model to evaluate the semantic similarity and the syntax similarity of the test questions respectively, comprehensively measures the similarity of the test questions, improves the screening accuracy, realizes the intelligent distribution of the test questions by comparing the multi-modal similarity score with a preset threshold value, and ensures the timeliness and the quality of the test question library.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a first embodiment of a test question management method based on multi-mode adaptive similarity learning according to the present application;
FIG. 2 is a schematic diagram of an original test question set collection process;
FIG. 3 is a schematic diagram of an adaptive similarity fusion process;
FIG. 4 is a schematic flow chart of a second embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application;
FIG. 5 is a schematic diagram of a module structure of the semantic similarity model SemBert-GCN;
FIG. 6 is a schematic flow chart of a third embodiment of a test question management method based on multi-modal adaptive similarity learning according to the present application;
FIG. 7 is a schematic block diagram of a syntactic similarity model TE-PTK;
fig. 8 is a full-flow schematic diagram of a test question management method based on multi-mode adaptive similarity learning according to the present application.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.
For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.
Referring to fig. 1, fig. 1 is a flow chart of a first embodiment of a test question management method based on multi-mode adaptive similarity learning, in this embodiment, the method includes steps S10 to S30:
step S10, collecting an original test question set, generating an evolution test question set through a preset large language model, generating a variation test question set by adopting a data enhancement technology, and carrying out fusion treatment on the original test question set, the evolution test question set and the variation test question set to form a test question ecological base.
It should be noted that, in the scenario of building an intelligent question bank, the execution body of the embodiment may be a computing service device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, a question bank server, or other electronic devices capable of implementing functions such as a graph or the like.
It should be understood that the original test question set may be public test question data collected from a network, for example, the original test question data is obtained from a structured data source such as a university official network, a MOOC platform and the like through an automatic crawler technology, which includes specific technical means such as API interface call, webpage analysis (BeautifulSoup/Selenium) and the like.
It will be appreciated that a data collection range may be preset to ensure that data collection is within the framework of legal, ethical and efficient. For example, to ensure legitimacy, explicit authorization of the data owner or associated regulatory agency must be obtained. Meanwhile, the robots. Txt protocol should be strictly adhered to, focusing on web pages of public information, avoiding touching sensitive or proprietary data. The storage and use of data should ensure security and transparency and is limited to predefined purposes only.
Specifically, referring to fig. 2, fig. 2 is a schematic diagram of an original test question set collection process. As shown in FIG. 2, when the original test question set is collected, a web page crawling technology can be used to collect the test question corpus of artificial intelligence, a script system is adopted to acquire links of related program pages in the process, an HTML parser such as WebDriver and Python libraries (Requests and BeautifulSoup) of Selenium is used to parse web page contents and organize the web page contents into a structured data frame, after test question information extraction is completed, the test question information is compiled into a single data frame, and the compiled data frame is stored as a file in CSV format, wherein the file size is about 10MB. The rate of data acquisition is set to acquire every 5-10 seconds, taking into account the load of the server.
Then, the collected original test question data can be subjected to data processing so as to ensure the quality and uniformity of the generated corpus. Therefore, a standardized term system can be introduced to perform standardization processing on the program name, the department name, the course code and other core entities, so that confusion caused by naming differences is eliminated, and the consistency and comparability of data are enhanced. In addition, a series of data consistency verification processes can be implemented, through careful comparison of cross verification and original website information, irregular and conflict information in a corpus is effectively identified and corrected, and accuracy and completeness of acquiring original test question data are ensured.
Furthermore, the test questions can be finely classified and labeled in multiple labels according to the properties, types, the belonging fields, specific knowledge points and other dimensions of the test questions, so as to form an original test question Set Ori with 'gene coding'.
It should be noted that the preset large language model may be constructed based on general LLMs such as GPT-4 and PaLM or a special model based on corpus fine adjustment in the education field. A Set of refined prompting templates can be Set for guiding LLMs to mine the original test questions and the known information in the original test questions so as to form an evolutionary test question Set evo.
By way of example, the specific requirements of the prompt module may be 1) objectively and accurately answer known information about a given test question, 2) ensure that your answer is refined and complete, no more than four sentences, 3) generate new test questions containing important semantic information, 4) not change knowledge point features covered by the original test questions when generating new test questions, the term "objectively accurate" is intended to reduce hallucinations and ambiguities in LLMs, the purpose of claim 3 is to keep the test question semantic information generated by LLMs uncorrupted, and to emphasize that the answer should not be irrelevant information.
The data enhancement technique can include synonym substitution, sentence pattern recombination, conditional variation and other operations. The original test question Set is controllably modified by the different technical enhancement means, so that the data diversity of the original test question Set can be enhanced, and a variation test question Set mut is obtained.
In the data enhancement process, the method can not only conduct difficulty gradient adjustment, condition change and situation simulation on the test questions under the same knowledge point to generate hierarchical variants, but also create new test question situations by introducing elements crossing disciplines or combining temporal hot spots, and further widen coverage of the test questions. In addition, the expression of sentences can be modified by using synonym and paraphrase replacement strategies, the diversity of sentences is increased, the sentence word sequence and the sentence structure are changed through sentence exchange and random exchange, and the presentation mode of test questions is enriched.
It should be appreciated that the knowledge of the test questions can be mined in depth using a large language model as the "natural choice" engine. The data augmentation technology is introduced to simulate genetic variation, so that the diversity of test question data can be enriched. The original test question set, the evolution test question set generated by the large model and the variation test question set after data enhancement are fused, so that a test question ecological library with a structure can be formed.
Further, the question ecological base can be standardized by data cleaning and filtering technology. This process involves identifying and eliminating those candidate keywords that fit a particular pattern, such as removing HTML tags, eliminating non-text characters (e.g., special symbols and emoticons), trimming excess space and line breaks, and normalizing the text encoding format. In addition, proper processing of numbers and special formats is included, as well as performing stop word filtering.
The method comprises the steps of firstly setting a data acquisition range, acquiring an original test question set in a data source based on the data acquisition range, then constructing a prompting template containing semantic constraint conditions, guiding a preset large language model through the prompting template to generate an evolution test question set based on the original test question set, expanding and mutating the original test question set through a data enhancement mode of synonym replacement, sentence pattern recombination and knowledge point expansion to generate a variation test question set, and finally fusing the original test question set, the evolution test question set and the variation test question set to obtain fused test question data, and performing deduplication processing on the fused test question data to form a standardized test question ecological base.
And S20, respectively inputting the test question ecological base into a preset semantic similarity model and a preset syntax similarity model to obtain the semantic similarity score and the syntax similarity score of each test question in the test question ecological base.
It should be noted that, the test question ecological library may be respectively put into a preset semantic similarity model SemBert-GCN and a preset syntax similarity model TE-PTK, so as to obtain the similarity of the test questions on the semantic and syntax structure levels.
It should be understood that the semantic similarity model SemBert-GCN is used for obtaining the similarity score S g of the test question at the semantic level, and the syntactic similarity model TE-PTK is used for obtaining the similarity score S s of the test question at the semantic level.
Wherein SemBert-GCN may include SemGloVe module, a preset BERT module and a graphic neural network (GCN) module. The SemGloVe module takes a standardized and structured test question ecological library as input, maps the vocabulary to a high-dimensional vector space to capture global semantic relations among the words, and the preset BERT module can generate a score matrix containing rich semantic features and token embedding by integrating word similarity matrices. The GCN module is embedded with a scoring matrix and a token carrying semantic information to obtain vector representation of similar test questions, and then the vector representation is subjected to pooling activation processing to obtain a semantic similarity score S q representing sentences.
The TE-PTK model can comprise a Chinese word segmentation module, a part-of-speech judgment and terminology module and a PT tree core module. The Chinese word segmentation module takes a multi-attribute label test question information sequence as input, adopts a CNN-BIGRU-CRF network model, and cuts an gapless character sequence into a word sequence with definite meaning and boundary according to key words and context in the test question information. The part-of-speech judging and terminology module can automatically assign corresponding grammar function labels to each vocabulary by utilizing Stanford corenlp equal word segmentation tools so as to realize accurate recognition of the part-of-speech. In order to solve the ambiguity problem of text terms, the terms can be conceptualized by adopting a CN-Probase knowledge graph developed by a complex denier university knowledge factory laboratory, and the terms with useless information can be screened and removed, so that the definition and accuracy of the terms in sentences can be ensured. The method can use terms as basic semantic units to construct a selected-area analysis tree of the short text, enhance the semantic expressive force of the words, and enable an taker to calculate the number of public substructures between the two trees by using PT cores so as to obtain a syntactic similarity score S s of the test question text.
And step S30, obtaining a multi-modal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multi-modal similarity score with a preset threshold value, and respectively importing each test question into an auditing module or a question library storage module according to a comparison result.
It should be understood that after the semantic similarity Score S g and the syntax similarity Score S s are obtained, the contribution of different similarity scores can be dynamically adjusted by using the weight coefficients α, β, δ based on the adaptive equalization policy technology, and an interaction term Φ (S g,Ss) is introduced to capture the interaction between the different similarity scores, so as to realize the effective fusion of the semantic similarity and the syntax similarity, thereby calculating the multi-modal similarity Score of the test question. Referring to fig. 3, fig. 3 is a schematic diagram of an adaptive similarity fusion process, and based on fig. 3, the calculation formula of the multi-modal similarity score may be:
Score=Softmax(α·σ(Sg)+β·σ(Ss)+δ·φ(Sg,Ss,Sk)) (0)
Where α, β, δ are weight coefficients for balancing the importance of different similarity scores and interaction terms and satisfy α+β+δ=1. Sigma (x) is a Sigmoid function for limiting the score between 0 and 1, phi (S g,Ss) for capturing interactions between different modality scores can be defined as Tanh (mu 1·Sg2·Ss),μ12 is a parameter for adjusting interactions; complex relationships between different modality similarities are captured by means of the Tanh function, taking into account not only the individual contributions of each modality similarity in this formula.
It should be noted that the preset threshold may be a preset similarity threshold μ. Based on the multi-mode similarity Score, the dynamic threshold value distribution module can be used for effectively screening the warehousing qualification of the test questions. If the multi-modal similarity Score of the test question exceeds the preset similarity threshold μ, the test question will be accurately imported into the test question auditing module for further evaluation or "extinction". Otherwise, the data is included in the question bank for storage.
In a specific implementation, the semantic similarity score and the syntactic similarity score are subjected to nonlinear fusion based on a dynamic weighting mechanism to generate a comprehensive similarity score, the comprehensive similarity score is compared with a preset threshold, the test questions are imported into an auditing module if the comprehensive similarity score is higher than the preset threshold, and the test questions are imported into a question bank storage module if the comprehensive similarity score is not higher than the preset threshold.
According to the method, the device and the system, the test question ecological library is built by fusing the original test question set, the evolution test question set and the variation test question set, diversity and coverage of test question resources are enriched, semantic similarity and syntactic similarity of the test questions are respectively estimated by adopting a semantic similarity model and a syntactic similarity model, similarity of the test questions is comprehensively measured, screening accuracy is improved, intelligent distribution of the test questions is achieved through comparison of multi-mode similarity scores and a preset threshold value, and timeliness and quality of the test question library are guaranteed.
In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On this basis, please refer to fig. 4, fig. 4 is a flowchart illustrating a second embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application.
In this embodiment, in order to illustrate a specific process of obtaining a semantic similarity score through a preset semantic similarity model, the step of inputting the test question ecological base into the preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological base includes steps S201 to S205:
Step S201, inputting the test question ecological base into a preset semantic similarity model, wherein the preset semantic similarity model comprises a SemGloVe module, a preset BERT module, a graph neural network module and a similarity scoring module.
It should be understood that the structure of the preset semantic similarity model SemBert-GCN of the present embodiment may be described herein in conjunction with fig. 5, and fig. 5 is a schematic diagram of the module structure of the semantic similarity model SemBert-GCN.
In the step S202, in the SemGloVe module, inter-vocabulary semantic association of test question text is established through word co-occurrence matrix analysis, and word level similarity features are extracted by combining an attention mechanism to construct a word similarity matrix.
It should be understood that SemGloVe model takes standardized "test question ecological library" as input, captures vocabulary frequency and co-occurrence relation between them by analyzing global Word-to-Word co-occurrence count matrix X, in order to understand relation between words more intuitively, converts original BPE-to-BPE attention weight into Word-to-Word attention weight, the conversion process average and aggregate attention weights of all BPE marks belonging to the same Word, thus obtaining attention weight AW e R K×K of each Word to other words, and determines distance Div (w i,wj) between target Word w i and context Word w j by using Division distance function.
The data processing procedure of the SemGloVe module can be also described herein with reference to fig. 5, and the data processing procedure of the SemGloVe module is as follows:
First, semGloVe takes a standardized "test question ecological library" as input, and captures the vocabulary frequency and the co-occurrence relation between them by analyzing the global Word-to-Word co-occurrence count matrix X. Its term X i,j represents the total number of occurrences of a Word w j e V in the context of a Word w i e V, where V is the vocabulary of the training corpus, for the separation of global Word-to-Word co-occurrence counts w i and w j (i.e. their distance between texts), semGloVe specifies that X i,j is:
Where P i and P j are locations in the context, the closer the words to w i the greater the weight obtained.
Next, for a given word sequence w= { W 1,…,wk }, BERT converts each word (or fragment of a word) into BPE (Byte Pair Encoding) tags, because BERT, when processing text, breaks down words into multiple BPE units or byte pairs for encoding in order to capture finer semantic information. In order to more intuitively understand the relationship between words, the SemGloVe module converts the original BPE-to-BPE attention weight into Word-to-Word attention weight, and the conversion process carries out average aggregation on the attention weights of all BPE marks belonging to the same Word, so as to obtain the attention weight of each Word to other words. The generated Word-to-Word attention weights AW ε R K×K, for the local window context w j of Word w i, j ε [ i-S, i+S ], the attention weights AW i,j from Word w i to Word w j can be expressed as:
Where m and n are the number of subwords of w i and w j, respectively, and AT (k, l) represents the attention weight from BPE markers t k to t 1. AW i,j is arranged in descending order and the first s words are selected as context words C (w i) of w i to exclude semantically nonsensical words.
Finally, the distance Div (w i,wj) between the target word w i and the context word w j is determined by a Division distance function.
In a specific implementation, the SemGloVe module can acquire test question texts in the test question ecological library, acquire the vocabulary frequency and the vocabulary co-occurrence relation of each vocabulary in the test question texts through the word co-occurrence matrix, construct a global word co-occurrence count matrix, average aggregate attention weights corresponding to codes of each byte under each vocabulary to acquire the attention weights corresponding to each vocabulary, and accordingly determine semantic association among the vocabularies through a Division distance function based on the attention weights corresponding to each vocabulary to acquire a word similarity matrix.
Step S203, integrating the word similarity matrix through a preset BERT module to perform multi-level semantic characterization, and generating word embedding vectors with semantic information.
It should be understood that the preset BERT module may be a post-fine-tuning BERT model, by fine-tuning a baseline BERT model, the term similarity matrix S constructed by the evolutionary population S 1 and the original population S 2 generated by LLMs is integrated into a multi-head attention mechanism of BERT, and the similarity matrix S is used as additional input information, so that the BERT model can evaluate semantic relationships among terms more accurately in its self-attention layer, thereby improving the understanding ability of the model on complex semantic structures. The multi-headed attention mechanism of BERT generates an output vector MultiHead (Θ, K, ψ) by linearly transforming the query (Θ), key (K) and value (ψ) vectors, applying the scaled dot product attention, and stitching and linearly transforming again the results of the multiple "heads". Since the model injects a word similarity matrix S to calculate Hadamard products, BERT Attention using scaled dot product calculations can be denoted as Attention (Θ, K, ψ).
The data processing procedure of the preset BERT module can be described here in conjunction with fig. 5, and the data processing procedure of the preset BERT module is as follows:
Firstly, through fine tuning of a BERT model of a base line, a word similarity matrix S= { p 1,1,…pi,j…pl,l(S1,S2) constructed by an original population S 1={p1,…pi…pl(S1) and an evolutionary population S 2={p1,…pi…pl(S2 generated by LLMs is integrated into a multi-head attention mechanism of the BERT, and by taking the similarity matrix S as additional input information, the BERT model can evaluate semantic relations among words more accurately in a self-attention layer of the BERT model, so that the understanding capability of the model on complex semantic structures is improved. The multi-headed attention mechanism of BERT works by linearly transforming the query (Θ), key (K) and value (ψ) vectors, then applying scaled dot product attention, finally, the results of the plurality of "heads" are spliced and linearly transformed again to generate an output vector, which can be expressed as:
MultiHead(Θ,K,Ψ)=Concat(head1,…,headh)WO (5)
Wherein the method comprises the steps of AndThe parameter matrixes corresponding to the ith attention head query, key and value are respectively shown, and W O is a weight matrix when the ith attention head is spliced.
Finally, since the model injects the word similarity matrix S to calculate Hadamard products, the model is focused more on word pairs with higher similarity in sentence pairs, BERT attention is expressed as:
Scores=ΘKT*S+MASK (6)
in a specific implementation, the preset BERT module can integrate the word similarity matrix into a multi-head attention mechanism of the BERT model to calculate corresponding attention weights through input expression vectors, integrate the attention weights corresponding to all attention heads to obtain attention output, and perform linear transformation on the attention output to obtain word embedding vectors with semantic information.
And S204, constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating sentence vectors based on the adjacency node information aggregation in the adjacency matrix.
It should be appreciated that the word embedding h i for each token carrying semantic information obtained by the BERT is input and passed to the subsequent GCN model. Unlike the standard GCN model, the score matrix score is used as an adjacency matrix A i,j, each token is used as each node in the GCN, and relative position codes are added into the GCN so that the relative position information of the tokens can be learned. Based on the adjacency matrix A i,j, for a given node i, the GCN gathers the relevant semantic information that its context word carries in A i,j and represents it by computing the output of node iAfter being processed by the GCN module, the vector of each token is obtainedThe vector h s of the sentence is obtained after Average pooling.
The data processing procedure of the graphic neural network (GCN) module can be described herein with reference to fig. 5, and the data processing procedure of the GCN module is as follows:
Firstly, taking the word embedding h i of each token carrying semantic information obtained by BERT as input, and transmitting the word embedding h i into a subsequent GCN model. Unlike the standard GCN model, the score matrix score is used as an adjacency matrix A i,j, each token is used as each node in the GCN, and relative position codes are added into the GCN so that the relative position information of the tokens can be learned. Based on the adjacency matrix A i,j, for a given node i, the GCN gathers the relevant semantic information that its context word carries in A i,j and represents it by computing the output of node i
Secondly, after being processed by a GCN module, the vector of each token is obtainedThe vector of the sentence is obtained after Average pooling:
Where m is the total number of tokens, i.e. the length of the sentence sequence.
In a specific implementation, the GCN module can construct an adjacency matrix according to the word embedding vector and construct a graph structure representation according to the adjacency matrix, add corresponding relative position codes to nodes in the graph structure representation to obtain a graph structure with position information, perform graph convolution operation based on the graph structure with the position information to aggregate adjacent node information of the nodes to obtain updated node embeddings, aggregate the updated node embeddings and generate sentence vectors.
Step S205, in the similarity scoring module, the sentence vectors are sequentially processed through a full-connection layer and a Softmax layer, so that semantic similarity scores of all the test questions are obtained.
In a specific implementation, the sentence vector h s obtained after pooling is processed through a full connection layer and then passes through a Softmax layer, and finally the sentence semantic similarity score S g is obtained.
In the embodiment, the semantic similarity model captures global semantic relations of words, semantic association among words and semantic information of sentences layer by layer through a SemGloVe module, a post-fine-tuning BERT module and a GCN module, so that a semantic similarity score is finally output. The method is favorable for accurately evaluating the semantic similarity of the test questions, capturing deep semantic relations, screening out high-quality and non-repeated test questions, improving the quality of a question bank and optimizing intelligent group paper.
In the third embodiment of the present application, the same or similar contents as those of the first, second or the first embodiment can be referred to the description above, and the description is omitted. On this basis, please refer to fig. 6, fig. 6 is a flowchart illustrating a third embodiment of the test question management method based on multi-mode adaptive similarity learning according to the present application.
In this embodiment, in order to illustrate a specific process of obtaining a syntax similarity score through a preset syntax similarity model, the step of inputting the test question ecological base into the preset syntax similarity model to obtain the syntax similarity score of each test question in the test question ecological base includes steps S206-S209:
step S206, inputting the test question ecological base into a preset syntax similarity model, wherein the preset syntax similarity model comprises a Chinese word segmentation module, a part-of-speech judgment and glossary module and a PT tree core module.
It should be understood that, here, the structure of the preset syntax similarity model TE-PTK of the present embodiment may be described in conjunction with fig. 7, and fig. 7 is a schematic diagram of the module structure of the syntax similarity model TE-PTK.
And S207, segmenting the test question text into word sequences according to the key words and the context in the test question information through the Chinese word segmentation module.
It should be noted that, the Chinese word segmentation module adopts a CNN-BIGRU-CRF network model to execute word segmentation tasks of the test question information sequence, and the Chinese word segmentation module mainly comprises a CNN layer, a BIGRU layer and a CRF layer.
It should be noted that the chinese word segmentation module takes the multi-attribute tag test question information sequence as input, adopts the CNN-BIGRU-CRF network model to perform chinese word segmentation, and cuts the non-interval character sequence into word sequence y (M) with definite meaning and boundary according to the key word and context in the test question information.
The data processing procedure of the chinese word segmentation module can be described herein with reference to fig. 7, and the data processing procedure of the chinese word segmentation module is as follows:
Firstly, the CNN layer is combined with the embedding layer to map the text sequence into a word vector matrix C E R k×d, wherein each word vector x i∈Rk×d, k is the sentence length, d is the word vector dimension, and deep features E= [ h 1,h2,…hn ] in the text data can be effectively identified and captured by utilizing the self-adaptive learning capability of the convolutional network feature extractor.
Secondly, the Bi-GRU layer adopts a bidirectional gating circulation unit network to accurately capture the time sequence dependence of the text, including long-term dependence and context information. Through Bi-GRU layer processing, deep feature vector E= [ h 1,h2,…hn ] in text data is converted into a sequence expressed as M= { M 1,…,mi,…,mn }, wherein M i is the ith input character vector of the CRF layer.
Finally, the CRF layer optimizes the prediction result of the sequence labeling task by considering the dependency relationship among the labels in the sequence. Let the tag sequence be y= { Y 1,y2…yn }, where Y (M) is the set of tag sequences for M, the CRF model conditional probability P (y|m) is expressed as:
Wherein the phi (Y t-1,yt, M) function is used to calculate the score of the input tag sequence Y, A representing the tag transfer score matrix, For a score going from tag y i to y i+1, the larger its value, the greater the transition probability; score matrix output for Bi-GRU network Y i th output matrix, specifically denoted as the i-th character), matrixThe corresponding loss function obtained by calculating the maximum likelihood estimate through CRF can be expressed as n x k:
Wherein N is the number of the trained marked sentences, and Y i is the real tag sequence of the ith sentence.
In the specific implementation, a Chinese word segmentation module maps a text sequence into a word vector matrix through a CNN layer and an embedding layer to obtain a word vector matrix corresponding to a test question text, captures context information in the test question text through a BIGRU layer, converts the word vector matrix into a sequence representation, and obtains the dependency relationship of each label in the sequence representation through a CRF layer to optimize the prediction result of a sequence labeling task to obtain a word sequence.
And step S208, marking the parts of speech of each word in the word sequence through the part of speech judging and glossary module, and glossary representing each marked word to obtain a plurality of glossary triples.
It should be understood that, after the accurate word segmentation result is obtained, the part-of-speech judging and naming module may make language comments such as part-of-speech labels for the text through Stanfordcorenlp toolkit, and define conceptual classes and instance classes for distinguishing noun terms. For the characteristic that short text mainly consists of terms, semantic vectors are constructed through a dynamic vector space model, term set sums are extracted from short text sums, terms are expressed in the form of triples (terms, part of speech labels and concepts), and the terms are combined to form a joint term set. Semantic vectors for each term are constructed based on term similarity, while smoothing inverse frequency is employed as an attention weight to emphasize the different contributions of different terms to the meaning of the short text.
The data processing procedure of the part-of-speech judging and naming module can be described herein with reference to fig. 7, and the data processing procedure of the part-of-speech judging and naming module is as follows:
Firstly, the part of speech judgment and terminology module can use the toolkit Stanfordcorenlp1 of natural language processing to determine the part of speech of each word after obtaining more accurate word segmentation result by using the Chinese word segmentation module. And may generate language notes for the text by calling corenlp library, including part-of-speech tagging, boundary recognition of sentences and tags, named entity recognition, quotation properties, and their relationship recognition, etc. To explicitly distinguish the parts of speech of noun terms, type (t) is defined to divide terms into conceptual and example classes:
Where E t/Ct is a set of instances/concepts at time t, freq (E)/freq (c) is the frequency ratio when instances/concepts, |·| is the number of terms in the set.
Next, it is reasonable and effective to use terms to express the meaning of a test question, considering that short text is mainly composed of terms. Aiming at the sparsity problem of the short text, a dynamic vector space model is introduced to construct the semantic vector of the short text. The term sets T 1 and T 2 are extracted from two short texts S 1 and S 2, respectively, each term being represented in the term set in the form of a triplet (term, part of speech notation, concept), and then the semantic vectors of S 1 and S 2 are constructed. The term sets T 1 and T 2 are first combined to form a joint term set T, and then semantic vectors are constructed for each term in T based on term similarity. Taking S 1 as an example, the calculation method of the ith dimension of the semantic vector is as follows:
semantic_vectori=sim(vector1,vector2)*W1*W2 (14)
If the vector does not belong to T 1, calculating the semantic similarity of each item in the vector and T 1, selecting the score of the item with the highest similarity score of T 1 and the vector, and if the vector belongs to T 1, the score is 1. Where sim (vector 1,vector2) is the cosine similarity function of the two vectors.
Finally, since different terms have different contributions to the short text, e.g., the stop word has a smaller contribution to the overall meaning of the short text, the attention weight W emphasizes the term's contribution with a smooth inverse frequency SIF, S F can be expressed by the following formula:
wherein, the For the smoothing parameter, L (node) is the word frequency of the node.
In a specific implementation, a part-of-speech judgment and terminology module carries out part-of-speech labeling on each word in a word sequence by adopting a natural language processing tool kit to obtain a word sequence after part-of-speech labeling, confirms a plurality of key terms based on the word sequence after part-of-speech labeling, carries out conceptual mapping on each key term to obtain a term list after conceptual mapping, constructs semantic vector representation of each key term through a dynamic vector space model based on the term list after conceptual mapping, calculates semantic similarity among each key term, integrates each key term according to the semantic similarity among each key term to form a joint term set, and the joint term set consists of a plurality of term triples.
In step S209, in the module for constructing PT tree core, a syntax analysis tree CPT is constructed based on each term triplet, node similarity is calculated through PT core, and a syntax similarity score is obtained according to the similarity calculation result.
It should be understood that the PT tree kernel building module may accurately build the CPT of the short text by using terms as basic semantic units, calculate the similarity of the corresponding nodes in the CPT by using the PTK tree kernel, and accumulate and normalize the similarity scores of all the nodes to obtain a score S s that comprehensively reflects the syntactic structure similarity of two sentences.
The data processing procedure for constructing the PT tree core module can be also described herein with reference to fig. 7, and the data processing procedure for constructing the PT tree core module is as follows:
First, a syntax parsing tree (CPT) can be constructed by constructing the PT tree core module, the CPT can reveal phrase structures in short texts and hierarchical syntax relationships thereof, and the tree structure effectively reveals structural information of the texts. Whereas a single word may not be sufficient to convey complete semantics and a short text may not fully follow standard written grammar, traditional word segmentation methods may not be sufficient to guarantee the accuracy of the CPT. Therefore, CPT of short text can be constructed more accurately using terms as basic semantic units.
Next, the syntax similarity is calculated using PT cores, and similarly to other tree core calculation methods, a PT tree core function PTK between the trees T 1 and T 2 is defined as:
(16)
Wherein, T 1 and T 2 are CPT of S 1 and S 2 respectively, AndFor the node sets in T 1 and T 2, Δ (node 1,node2) is the number of common fragments of the root at node 1 and node 2, which are the cores of the tree cores. The evaluation of the common PT rooted at nodes node1 and node2 requires the selection of a shared subset of two nodes, which are generated using a sub-sequence kernel method in view of their order of importance in the syntactic structure.
In the PT core, Δ (node 1,node2) is represented by the following formula:
where alpha and beta are attenuation factors, alpha is the height of the tree, beta is the length of the subsequence, AndIs the ordered subsequence of node 1 and node 2, p min returnsAndThe minimum sequence length between Δp calculates the number of common subtrees of the root in the p sub-sequences. Avoiding introducing too much noise, the semantic information is integrated into the tree core, and when node 1 and node 2 are both leaf nodes, similarity is calculated using equation (14) to consider the semantic information.
To better understand the above equation, the solution is found by constructing a recursive function Δ p:
Wherein n 1 and n 2 are each AndIs represented by n 1 and n 2, respectively, n 1 and n 2, n 1 [1:i ] and n 2 [1:r ] represent the sub-sequences from 1 to i in n 1 and from 1 to r in n 1, respectively, and delta p-1 is calculated recursively, stopping when a leaf node is reached.
It will be appreciated that to calculate the syntactic similarity of two questions, CPT is first constructed for each sentence and then the similarity of the corresponding nodes in the trees is calculated using equation (17). And then accumulating and normalizing the similarity scores of all the nodes to obtain a composite reflecting two sentence syntactic structure similarity S s.
In a specific implementation, the PT tree core building module can build a corresponding grammar analysis tree CPT for each test question sentence based on each term triplet, calculate the similarity of corresponding nodes in each grammar analysis tree CPT through a preset node common segment number calculation formula, normalize the similarity of each node, and obtain the syntax similarity score between each test questions.
In the embodiment, the syntax similarity model processes test question information layer by layer through a Chinese word segmentation module, a part-of-speech judging and glossary module and a PT tree core building module, from word sequence to glossary, and then to similarity calculation of a syntax structure, and finally outputs a syntax similarity score. The method is beneficial to focusing on the syntactic structure of the test questions, accurately evaluating the structural similarity, identifying potential structural problems, enhancing the expression diversity of the question bank and assisting in personalized recommendation.
In addition, the whole flow of the present application will be described with reference to fig. 8, and fig. 8 is a schematic diagram of the whole flow of the test question management method based on multi-mode adaptive similarity learning according to the present application.
As can be seen from fig. 8, the whole process can be divided into a preliminary stage, an evolution stage, a screening stage and an extinction stage.
In the initial construction stage, artificial intelligent test question corpus information is acquired from network open source resources, and the test question information is endowed with clear gene codes through fine classification and multi-label labeling, so that an original test question set is obtained.
In the evolution stage, an evolution test question set is generated through a preset large language model, a data enhancement technology is adopted to generate a variation test question set, and then the original test question set, the evolution test question set and the variation test question set are fused to form a test question ecological base.
And then, carrying out text preprocessing on the test question data in the test question ecological library, and respectively inputting the text preprocessing to a preset semantic similarity model SemBert-GCN and a preset syntactic similarity model TE-PTK. In SemBert-GCN, sequentially passing through SemGloVe module, preset BERT module and graphic neural network (GCN) module, outputting semantic similarity score S g, and in TE-PTK, sequentially passing through Chinese word segmentation module, part-of-speech judgment and terminology module, constructing PT tree core module, and outputting syntax similarity score S s.
The semantic similarity Score S g and the syntactic similarity Score S s are input into the self-adaptive similarity fusion module ASIM, and a self-adaptive equalization strategy is adopted to realize fusion evaluation on the multi-modal characteristics of the test questions, so that the multi-modal similarity Score of the test questions is obtained.
In the screening stage, the dynamic threshold value distribution module can effectively screen the warehousing qualification of the test questions on the basis of the multi-mode similarity Score obtained in the preamble stage. If the multi-mode similarity score of the test question exceeds the preset similarity threshold mu, the test question is accurately imported into a test question auditing module to be further evaluated or deleted. Otherwise, it is included in the question bank.
Based on the whole flow of the method, the semantic and syntactic information of the test questions can be deeply fused, the similarity of the test questions is accurately measured, repeated and low-quality test questions are automatically screened, the updating and quality of the content of the question bank are ensured, and further, the accurate screening and quality improvement of the test question resources are realized. Thereby optimizing the structure of the question bank and better meeting the high standard requirement of the intelligent development of education.
It should be noted that the foregoing examples are only for understanding the present application, and do not constitute a limitation of the method for managing questions based on multi-modal adaptive similarity learning of the present application, and it is within the scope of the present application to perform more forms of simple transformation based on the technical idea.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are only for describing, not representing, and are only some embodiments of the present application, not limiting the scope of the present application, and all equivalent structural changes made by the description of the present application and the accompanying drawings under the technical concept of the present application, or direct/indirect application in other related technical fields are included in the scope of the present application.

Claims (10)

1.一种基于多模态自适应相似度学习的试题管理方法,其特征在于,所述方法包括:1. A test question management method based on multimodal adaptive similarity learning, characterized in that the method comprises: 采集原始试题集,并通过预设大语言模型生成进化试题集,采用数据增强技术生成变异试题集,将所述原始试题集、所述进化试题集和所述变异试题集进行融合处理,形成试题生态库;Collecting the original test question set, generating an evolved test question set by using a preset large language model, generating a variant test question set by using data enhancement technology, and fusing the original test question set, the evolved test question set and the variant test question set to form a test question ecological library; 将所述试题生态库分别输入预设语义相似度模型以及预设句法相似度模型,获得所述试题生态库中各试题的语义相似度分数以及句法相似度分数;Inputting the test question ecological library into a preset semantic similarity model and a preset syntactic similarity model respectively, to obtain a semantic similarity score and a syntactic similarity score of each test question in the test question ecological library; 根据所述语义相似度分数和句法相似度分数获得多模态相似度分数,并将所述多模态相似度分数与预设阈值进行对比,根据对比结果将各所述试题分别导入审核模块或题库存储模块。A multimodal similarity score is obtained according to the semantic similarity score and the syntactic similarity score, and the multimodal similarity score is compared with a preset threshold value. According to the comparison result, each of the test questions is imported into a review module or a question bank storage module. 2.如权利要求1所述的方法,其特征在于,所述采集原始试题集,并通过预设大语言模型生成进化试题集,采用数据增强技术生成变异试题集,将所述原始试题集、所述进化试题集和所述变异试题集进行融合处理,形成试题生态库的步骤,包括:2. The method according to claim 1, characterized in that the steps of collecting the original test question set, generating the evolved test question set by using a preset large language model, generating the variant test question set by using data enhancement technology, and fusing the original test question set, the evolved test question set and the variant test question set to form a test question ecological library include: 设置数据采集范围,并基于所述数据采集范围在数据源中采集原始试题集;Setting a data collection scope, and collecting an original test question set from a data source based on the data collection scope; 构建包含语义约束条件的提示模板,并通过所述提示模板指导预设大语言模型基于原始试题集生成进化试题集;Constructing a prompt template including semantic constraints, and guiding a preset large language model to generate an evolved test question set based on the original test question set through the prompt template; 通过同义词替换、句式重组和知识点扩展的数据增强方式对所述原始试题集进行扩展和变异,生成变异试题集;The original test question set is expanded and mutated by data enhancement methods such as synonym replacement, sentence reorganization and knowledge point expansion to generate a mutated test question set; 融合所述原始试题集、所述进化试题集和所述变异试题集,获得融合试题数据,并对所述融合试题数据进行去重处理,形成标准化的试题生态库。The original test question set, the evolved test question set and the variant test question set are integrated to obtain integrated test question data, and the integrated test question data is deduplicated to form a standardized test question ecological library. 3.如权利要求1所述的方法,其特征在于,将所述试题生态库输入预设语义相似度模型,获得所述试题生态库中各试题的语义相似度分数的步骤,包括:3. The method according to claim 1, characterized in that the step of inputting the test question ecological library into a preset semantic similarity model to obtain the semantic similarity score of each test question in the test question ecological library comprises: 将所述试题生态库输入预设语义相似度模型,所述预设语义相似度模型包括SemGloVe模块、预设BERT模块、图神经网络模块以及相似度评分模块;Inputting the test question ecological library into a preset semantic similarity model, wherein the preset semantic similarity model includes a SemGloVe module, a preset BERT module, a graph neural network module, and a similarity scoring module; 在SemGloVe模块中,通过词共现矩阵分析建立试题文本的词汇间语义关联,并结合注意力机制提取词语级相似特征以构建词语相似矩阵;In the SemGloVe module, the semantic association between words in the test text is established through word co-occurrence matrix analysis, and the word-level similarity features are extracted in combination with the attention mechanism to construct a word similarity matrix; 通过预设BERT模块整合词语相似矩阵以进行多层次语义表征,生成具有语义信息的词嵌入向量;The word similarity matrix is integrated through the preset BERT module to perform multi-level semantic representation and generate word embedding vectors with semantic information; 通过所述图神经网络模块根据所述词嵌入向量构建邻接矩阵,并基于所述邻接矩阵中邻接节点信息聚合生成句向量;Constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and generating a sentence vector based on the aggregation of adjacent node information in the adjacency matrix; 在所述相似度评分模块中,通过全连接层和Softmax层依次处理所述句向量,获得各试题的语义相似度评分。In the similarity scoring module, the sentence vectors are processed in sequence through the fully connected layer and the Softmax layer to obtain the semantic similarity score of each test question. 4.如权利要求3所述的方法,其特征在于,所述在SemGloVe模块中,通过词共现矩阵分析建立试题文本的词汇间语义关联,并结合注意力机制提取词语级相似特征以构建词语相似矩阵的步骤,包括:4. The method according to claim 3, characterized in that the step of establishing the semantic association between the words of the test text through word co-occurrence matrix analysis in the SemGloVe module, and extracting word-level similarity features in combination with the attention mechanism to construct a word similarity matrix, comprises: 获取所述试题生态库中的试题文本,并通过词共现矩阵获取所述试题文本中各词汇的词汇频率以及词汇共现关系,构建全局词共现计数矩阵;Obtain the test text in the test ecological library, and obtain the vocabulary frequency and vocabulary co-occurrence relationship of each vocabulary in the test text through the word co-occurrence matrix, and construct a global word co-occurrence count matrix; 将各所述词汇下各字节对编码对应的注意力权重进行平均聚合,获得各所述词汇对应的注意力权重;The attention weights corresponding to the encodings of the byte pairs under each of the words are averaged and aggregated to obtain the attention weights corresponding to the words; 基于各所述词汇对应的注意力权重,通过Division距离函数确定各所述词汇间的语义关联,获得词语相似矩阵。Based on the attention weights corresponding to the words, the semantic associations between the words are determined through the Division distance function to obtain a word similarity matrix. 5.如权利要求3所述的方法,其特征在于,所述通过预设BERT模块整合词语相似矩阵以进行多层次语义表征,生成具有语义信息的词嵌入向量的步骤,包括:5. The method according to claim 3, characterized in that the step of integrating the word similarity matrix by presetting the BERT module to perform multi-level semantic representation and generate a word embedding vector with semantic information comprises: 将所述词语相似矩阵整合到BERT模型的多头注意力机制中,以通过输入表示向量计算对应的注意力权重;Integrate the word similarity matrix into the multi-head attention mechanism of the BERT model to calculate the corresponding attention weights through the input representation vector; 整合各注意力头对应的注意力权重,获得注意力输出,并对所述注意力输出进行线性变换,获得具有语义信息的词嵌入向量。The attention weights corresponding to each attention head are integrated to obtain the attention output, and the attention output is linearly transformed to obtain a word embedding vector with semantic information. 6.如权利要求3所述的方法,其特征在于,所述通过所述图神经网络模块根据所述词嵌入向量构建邻接矩阵,并基于所述邻接矩阵中邻接节点信息聚合生成句向量的步骤,包括:6. The method according to claim 3, characterized in that the step of constructing an adjacency matrix according to the word embedding vector through the graph neural network module, and aggregating and generating a sentence vector based on adjacent node information in the adjacency matrix comprises: 根据所述词嵌入向量构建邻接矩阵,并根据所述邻接矩阵构建图结构表示;Constructing an adjacency matrix according to the word embedding vectors, and constructing a graph structure representation according to the adjacency matrix; 对所述图结构表示中各节点添加对应的相对位置编码,获得带有位置信息的图结构;Adding corresponding relative position codes to each node in the graph structure representation to obtain a graph structure with position information; 基于所述带有位置信息的图结构进行图卷积操作,以聚合各所述节点的邻接节点信息,获得更新后节点嵌入;Performing a graph convolution operation based on the graph structure with position information to aggregate adjacent node information of each node to obtain updated node embedding; 聚合各所述更新后节点嵌入,生成句向量。Aggregate the updated node embeddings to generate a sentence vector. 7.如权利要求1所述的方法,其特征在于,将所述试题生态库输入预设句法相似度模型,获得所述试题生态库中各试题的句法相似度分数的步骤,包括:7. The method according to claim 1, characterized in that the step of inputting the test question ecological library into a preset syntactic similarity model to obtain the syntactic similarity score of each test question in the test question ecological library comprises: 将所述试题生态库输入预设句法相似度模型,所述预设句法相似度模型包括:中文分词模块、词性判断及术语化模块,构建PT树核模块;Input the test question ecological library into a preset syntactic similarity model, wherein the preset syntactic similarity model includes: a Chinese word segmentation module, a part-of-speech judgment and terminology module, and a PT tree core module; 通过所述中文分词模块,根据试题信息中的关键词汇和上下文语境,将试题文本切分为词语序列;By means of the Chinese word segmentation module, the test text is segmented into word sequences according to the key words and context in the test information; 通过所述词性判断及术语化模块对所述词语序列中各词语进行词性标注,并对标注后各词语进行术语化表示,获得若干术语三元组;Performing part-of-speech tagging on each word in the word sequence through the part-of-speech judgment and terminology module, and performing terminology representation on each tagged word to obtain a plurality of term triples; 在所述构建PT树核模块中,基于各所述术语三元组构建语法解析树CPT,并通过PT核计算节点相似度,根据相似度计算结果获得句法相似度分数。In the PT tree core construction module, a grammar parse tree CPT is constructed based on each of the term triples, and node similarity is calculated through the PT core, and a syntactic similarity score is obtained according to the similarity calculation result. 8.如权利要求7所述的方法,其特征在于,所述中文分词模块基于CNN-BiGRU-CRF复合神经网络模型所构建,所述通过所述中文分词模块,根据试题信息中的关键词汇和上下文语境,将试题文本切分为词语序列的步骤,包括:8. The method according to claim 7, characterized in that the Chinese word segmentation module is constructed based on a CNN-BiGRU-CRF composite neural network model, and the step of segmenting the test text into word sequences according to key words and context in the test information by the Chinese word segmentation module comprises: 通过CNN层结合嵌入层将文本序列映射为词向量矩阵,获得试题文本所对应的词向量矩阵;The text sequence is mapped into a word vector matrix through the CNN layer combined with the embedding layer to obtain the word vector matrix corresponding to the test text; 通过BIGRU层捕获试题文本中的上下文信息,将所述词向量矩阵转换为序列表示;Capturing contextual information in the test text through a BIGRU layer, and converting the word vector matrix into a sequence representation; 通过CRF层获取所述序列表示中各标签的依赖关系以优化序列标注任务的预测结果,获得词语序列。The dependency relationship between the labels in the sequence representation is obtained through the CRF layer to optimize the prediction result of the sequence labeling task and obtain the word sequence. 9.如权利要求7所述的方法,其特征在于,所述通过所述词性判断及术语化模块对所述词语序列中各词语进行词性标注,并对标注后各词语进行术语化表示,获得若干术语三元组的步骤,包括:9. The method according to claim 7, wherein the step of tagging each word in the word sequence with a part of speech by the part of speech judgment and terminology module, and terminologically representing each tagged word to obtain a plurality of term triples comprises: 采用自然语言处理工具包对所述词语序列中各词语进行词性标注,获得词性标注后词语序列;Using a natural language processing toolkit to perform part-of-speech tagging on each word in the word sequence to obtain a word sequence after part-of-speech tagging; 基于所述词性标注后词语序列确认若干关键术语,并对各所述关键术语进行概念映射,获得概念映射后的术语列表;Confirming a number of key terms based on the word sequence after the part-of-speech tagging, and performing concept mapping on each of the key terms to obtain a term list after concept mapping; 基于所述概念映射后的术语列表,通过动态向量空间模型构建各关键术语的语义向量表示;Based on the term list after the concept mapping, constructing the semantic vector representation of each key term through a dynamic vector space model; 计算各所述关键术语间的语义相似度,并根据各所述关键术语间的语义相似度整合各所述关键术语,形成联合术语集,所述联合术语集中由若干术语三元组组成。The semantic similarities between the key terms are calculated, and the key terms are integrated according to the semantic similarities between the key terms to form a joint term set, wherein the joint term set is composed of a plurality of term triples. 10.如权利要求1所述的方法,其特征在于,所述根据所述语义相似度分数和句法相似度分数获得多模态相似度分数,并将所述多模态相似度分数与预设阈值进行对比,根据对比结果将各所述试题分别导入审核模块或题库存储模块的步骤,包括:10. The method according to claim 1, characterized in that the step of obtaining a multimodal similarity score according to the semantic similarity score and the syntactic similarity score, comparing the multimodal similarity score with a preset threshold, and importing each of the test questions into a review module or a question bank storage module according to the comparison result comprises: 基于动态加权机制对所述语义相似度分数和句法相似度分数进行非线性融合,生成多模态相似度分数;Nonlinearly fusing the semantic similarity score and the syntactic similarity score based on a dynamic weighting mechanism to generate a multimodal similarity score; 将所述多模态相似度分数与预设阈值进行对比;Comparing the multimodal similarity score with a preset threshold; 若所述多模态相似度分数高于所述预设阈值,则将所述试题导入审核模块;If the multimodal similarity score is higher than the preset threshold, the test question is imported into the review module; 若所述多模态相似度分数不高于所述预设阈值,则将所述试题导入题库存储模块。If the multimodal similarity score is not higher than the preset threshold, the test question is imported into the question bank storage module.
CN202510434754.6A 2025-04-08 2025-04-08 Test question management method based on multimodal adaptive similarity learning Pending CN120336505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510434754.6A CN120336505A (en) 2025-04-08 2025-04-08 Test question management method based on multimodal adaptive similarity learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510434754.6A CN120336505A (en) 2025-04-08 2025-04-08 Test question management method based on multimodal adaptive similarity learning

Publications (1)

Publication Number Publication Date
CN120336505A true CN120336505A (en) 2025-07-18

Family

ID=96364498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510434754.6A Pending CN120336505A (en) 2025-04-08 2025-04-08 Test question management method based on multimodal adaptive similarity learning

Country Status (1)

Country Link
CN (1) CN120336505A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121210953A (en) * 2025-11-27 2025-12-26 杭州元语智能科技有限公司 Evaluation data synthesis system integrating multi-model collaborative questions and multi-strategy filtering

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN121210953A (en) * 2025-11-27 2025-12-26 杭州元语智能科技有限公司 Evaluation data synthesis system integrating multi-model collaborative questions and multi-strategy filtering

Similar Documents

Publication Publication Date Title
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN115033659B (en) Clause-level automatic abstract model system based on deep learning and abstract generation method
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN120087351A (en) Research and development document processing method and device
CN114238653A (en) A method for the construction, completion and intelligent question answering of programming education knowledge graph
CN112598039B (en) Method for obtaining positive samples in NLP (non-linear liquid) classification field and related equipment
CN112199954B (en) Disease entity matching method and device based on voice semantics and computer equipment
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium
CN119026597B (en) Science and technology information source assessment method and device for science and technology information analysis
CN113051886A (en) Test question duplicate checking method and device, storage medium and equipment
CN116166792B (en) A template-based method and apparatus for generating Chinese privacy policy summaries
CN120234427B (en) Electronic government platform management method and system based on cloud data
CN119046433A (en) Output method, device, equipment and storage medium for search enhancement generation type question and answer
CN114911940B (en) Text emotion recognition method and device, electronic device, and storage medium
CN119577148A (en) A text classification method, device, computer equipment and storage medium
CN118673041A (en) Method and device for searching power business database table
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN120336505A (en) Test question management method based on multimodal adaptive similarity learning
CN116050422B (en) Method and device for identifying conversation intention in real estate industry
CN115169370A (en) Corpus data enhancement method and device, computer equipment and medium
CN119670764B (en) Long text information processing method, device, computer equipment and storage medium
CN118503411B (en) Outline generation method, model training method, equipment and medium
CN110909547A (en) Judicial entity identification method based on improved deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination