CN118395987A

CN118395987A - BERT-based landslide hazard assessment named entity identification method of multi-neural network

Info

Publication number: CN118395987A
Application number: CN202410411395.8A
Authority: CN
Inventors: 丁雨淋; 孙倩倩; 胡翰; 朱庆; 吴玉婷; 赵小霆; 陈曦; 卢文龙; 郝蕊; 程智博; 李泰灃
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2024-04-08
Filing date: 2024-04-08
Publication date: 2024-07-26

Abstract

The invention provides a method for identifying landslide hazard assessment named entities based on a BERT (binary-weighted round robin) multi-neural network, which mainly comprises the following steps: (1) English literature containing landslide hazard assessment is obtained, a Stanford CoreNLP natural language toolkit and a BIO marking method are adopted to mark the literature abstract, and the mark contains landslide occurrence positions, landslide influence factors and landslide hazard assessment method information; (2) Dividing landslide hazard assessment corpus into a training set, a testing set and a verification set according to a certain proportion, adopting a BERT model and a variant model (such as ALBERT and RoBERTa) as a pre-training language model to encode the landslide hazard assessment dataset, comparing named entity recognition effects of different models, and acquiring context semantic dependency information to generate word vectors with more representation and expression capability; (3) Inputting the obtained word vector into BiLSTM layers of learning text long-distance semantic information, and then inputting the captured semantic vector into a CRF decoding layer for decoding to obtain an optimal named entity recognition result in the landslide risk assessment field. The invention has strong effect on natural language processing through BERT and variant models thereof, and simultaneously utilizes BiLSTM and CRF fusion feature vectors to be used for landslide hazard assessment naming entity recognition tasks, thereby having stronger feature extraction capability.

Description

A named entity recognition method for landslide hazard assessment based on BERT multi-neural network

技术领域Technical Field

本发明涉及滑坡风险评估领域，具体涉及一种基于BERT的多神经网络的滑坡危险性评估命名实体识别方法。The present invention relates to the field of landslide risk assessment, and in particular to a named entity recognition method for landslide hazard assessment based on a BERT multi-neural network.

背景技术Background technique

及时、准确地评估滑坡的危险性对于防灾减灾具有重要意义。然而,与滑坡危险性评估相关的信息通常分散在大量非结构化的文本数据中,如地质勘探报告、学术论文、新闻报道等,难以高效地利用这些宝贵信息资源。Timely and accurate assessment of landslide hazard is of great significance for disaster prevention and mitigation. However, information related to landslide hazard assessment is usually scattered in a large amount of unstructured text data, such as geological exploration reports, academic papers, news reports, etc., making it difficult to efficiently use these valuable information resources.

传统的方法是人工从这些文本中识别和提取与滑坡相关的关键实体信息，如位置、致灾因素等。但这一过程存在效率低下、人工成本高、错误率较高等问题,难以满足实际需求。因此，亟需一种自动化的方法来高效、准确地从非结构化文本中识别和抽取滑坡相关实体信息。The traditional method is to manually identify and extract key entity information related to landslides from these texts, such as location, disaster factors, etc. However, this process has problems such as low efficiency, high labor cost, and high error rate, which makes it difficult to meet actual needs. Therefore, an automated method is urgently needed to efficiently and accurately identify and extract landslide-related entity information from unstructured texts.

命名实体识别是自然语言处理领域的一项重要技术，通过自动识别和分类文本中的实体词(如人名、地名、组织机构名等)，为进一步的信息处理奠定基础。随着深度学习技术的发展，基于神经网络的命名实体识别模型展现出了优异的性能。其中，BERT是一种通过大规模无监督预训练获得通用语义表示能力的语言模型，已被广泛应用于自然语言处理任务。利用BERT及其变体作为预训练模型，结合BiLSTM和CRF等神经网络，可以极大提升命名实体识别的精度和鲁棒性。Named entity recognition is an important technology in the field of natural language processing. It lays the foundation for further information processing by automatically identifying and classifying entity words in text (such as names of people, places, and organizations). With the development of deep learning technology, neural network-based named entity recognition models have shown excellent performance. Among them, BERT is a language model that obtains general semantic representation capabilities through large-scale unsupervised pre-training and has been widely used in natural language processing tasks. Using BERT and its variants as pre-training models, combined with neural networks such as BiLSTM and CRF, can greatly improve the accuracy and robustness of named entity recognition.

发明内容Summary of the invention

本发明提供了一种基于BERT的多神经网络的滑坡危险性评估命名实体识别方法，旨在解决传统滑坡危险性评估方法依赖于人工解析文献和数据、工作量大且耗时耗力的问题，以及现有命名实体识别方法在处理复杂自然语言文本时受限于语料和模型表达能力的问题。通过结合BERT及其变体模型的优势，实现对滑坡危险性评估文献中关键信息的自动化提取和标注，从而提高滑坡危险性评估的准确性和效率，减轻人工负担，为滑坡灾害预防和减灾提供更有效的技术支持。The present invention provides a method for named entity recognition of landslide hazard assessment based on a BERT multi-neural network, aiming to solve the problem that the traditional landslide hazard assessment method relies on manual analysis of literature and data, which is time-consuming and labor-intensive, and the existing named entity recognition method is limited by the corpus and model expression capabilities when processing complex natural language texts. By combining the advantages of BERT and its variant models, the automatic extraction and annotation of key information in landslide hazard assessment literature is achieved, thereby improving the accuracy and efficiency of landslide hazard assessment, reducing the manual burden, and providing more effective technical support for landslide disaster prevention and mitigation.

为达到上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical scheme:

一种基于BERT的多神经网络的滑坡危险性评估命名实体识别方法，包括如下步骤：A named entity recognition method for landslide hazard assessment based on a BERT-based multi-neural network includes the following steps:

步骤1、语料库构建：收集包含滑坡危险性评估信息的英文文献摘要，对其进行预处理生成规范化的结构化数据，构建滑坡危险性评估领域预料库；Step 1: Corpus construction: Collect English literature abstracts containing landslide hazard assessment information, preprocess them to generate standardized structured data, and build a landslide hazard assessment field prediction database;

步骤2、数据标注：采用Stanford CoreNLP自然语言工具包对数据进行分词、词性标注以及句法分析，再使用BIO(Beginning-Inside-Outside)标记法对滑坡危险性评估命名实体进行人工标注，包含滑坡发生位置、滑坡影响因子及滑坡危险性评估方法；Step 2: Data annotation: Stanford CoreNLP natural language toolkit was used to perform word segmentation, part-of-speech tagging and syntactic analysis on the data, and then the BIO (Beginning-Inside-Outside) tagging method was used to manually annotate the named entities of landslide hazard assessment, including the location of landslides, landslide influencing factors and landslide hazard assessment methods;

步骤3、将标注数据按照一定的比例划分为训练集、测试集和验证集，利用BERT及其变体ALBERT和RoBERTa等预训练语言模型，对标注数据进行编码，获取包含上下文语义信息的词向量表示；Step 3: Divide the labeled data into training set, test set and validation set according to a certain ratio, and use pre-trained language models such as BERT and its variants ALBERT and RoBERTa to encode the labeled data and obtain word vector representation containing contextual semantic information;

步骤4、将步骤3中得到的词向量输入到BiLSTM(Bi-directional Long Short-Term Memory)模型中，用于捕捉输入序列的长程依赖语义特征；Step 4: Input the word vector obtained in step 3 into the BiLSTM (Bi-directional Long Short-Term Memory) model to capture the long-range dependency semantic features of the input sequence;

步骤5、考虑到序列内标签之间的关系和约束，基于CRF解码层对整个输入序列的标注路径进行解码；Step 5: Considering the relationship and constraints between labels in the sequence, the annotation path of the entire input sequence is decoded based on the CRF decoding layer;

步骤6、基于BERT的多神经网络对目标域语料数据集进行微调获得最佳命名实体识别结果。Step 6: Fine-tune the target domain corpus dataset based on the BERT multi-neural network to obtain the best named entity recognition results.

进一步的所述步骤1中，语料库构建是基于谷歌学术设置文献检索关键词：Landslide HazardAssessment，再使用Python进行爬取，收集滑坡危险性评估领域的英文文献摘要，所述预处理包括去除语料中的重复字符、空字符以及特殊符号等，输出为规范的文本数据保持数据具有较高的准确性；Further in the step 1, the corpus is constructed by setting the literature search keyword: Landslide Hazard Assessment based on Google Scholar, and then using Python to crawl and collect English literature abstracts in the field of landslide hazard assessment. The preprocessing includes removing repeated characters, empty characters and special symbols in the corpus, and outputting it as standardized text data to maintain high data accuracy;

进一步的，在所述步骤2中对预处理后的滑坡危险性评估语料库进行标注，具体的标注过程包括：Furthermore, in step 2, the preprocessed landslide hazard assessment corpus is annotated, and the specific annotation process includes:

步骤2.1：使用Python交互界面采用Stanford CoreNLP自然语言工具包对数据进行标注：首先使用句子分割(Sentence Splitting)将文本切分为句子，使用分词(Tokenization)将文本切分为单词、标点符号等token，使用词性标注(Part-of-SpeechTagging)为每个token分配相应的词性(如名词、动词、形容词等)，最后使用语法分析(Parsing)构建文本的句法分析树和词性结构；Step 2.1: Use the Python interactive interface to annotate the data using the Stanford CoreNLP natural language toolkit: First, use sentence splitting to split the text into sentences, use tokenization to split the text into words, punctuation marks and other tokens, use part-of-speech tagging to assign the corresponding part of speech (such as noun, verb, adjective, etc.) to each token, and finally use parsing to build the syntactic analysis tree and part-of-speech structure of the text;

步骤2.2：使用BIO(Beginning-Inside-Outside)标记法对滑坡危险性评估命名实体进行人工标注，其中B–Beginning(开始)表示的是实体的初始标签，I–Inside(内部)表示的是实体的内部标签，O–Outside(外部)表示的是非实体。将领域语料库标注为3类实体，分别包括滑坡发生位置信息，滑坡影响因子以及滑坡危险性评估方法；Step 2.2: Use the BIO (Beginning-Inside-Outside) tagging method to manually annotate the named entities of landslide hazard assessment, where B-Beginning represents the initial label of the entity, I-Inside represents the internal label of the entity, and O-Outside represents the non-entity. The domain corpus is annotated into three types of entities, including landslide location information, landslide influencing factors, and landslide hazard assessment methods;

进一步的，在所述步骤3中具体内容包括：Furthermore, the specific contents in step 3 include:

步骤3.1：将标注数据集按照7:1:2的方式划分为训练集、测试集和验证集；Step 3.1: Divide the labeled dataset into training set, test set and validation set in a ratio of 7:1:2;

步骤3.2：基于训练集、测试集和验证集，于实验室服务器上训练基于BERT的全监督场景预训练语言模型，输出词向量x＝(x₁,x₂,…,x_N)；Step 3.2: Based on the training set, test set and validation set, train the BERT-based fully supervised scenario pre-trained language model on the laboratory server and output the word vector x = (x ₁ , x ₂ , …, x _N );

步骤3.3：基于训练集、测试集和验证集，于实验室服务器上训练基于ALBERT的全监督场景预训练语言模型，输出词向量y＝(y₁,y₂,…,y_N)；Step 3.3: Based on the training set, test set and validation set, train the ALBERT-based fully supervised scenario pre-trained language model on the laboratory server and output the word vector y = (y ₁ ,y ₂ ,…,y _N );

步骤3.4：基于训练集、测试集和验证集，于实验室服务器上训练基于RoBERTa的全监督场景预训练语言模型，输出词向量z＝(z₁,z₂,…,z_N)；Step 3.4: Based on the training set, test set and validation set, train a fully supervised scenario pre-trained language model based on RoBERTa on the laboratory server, and output the word vector z = (z ₁ , z ₂ , …, z _N );

进一步的，在所述步骤4中，将步骤3中得到的词向量x，y和z分别输入BiLSTM层，词向量从文本序列上下文进行特征学习，输出包含文本序列中的每个词所对应的前向和后向信息，捕捉文本序列中的语义特征。Furthermore, in step 4, the word vectors x, y and z obtained in step 3 are respectively input into the BiLSTM layer, and the word vectors are feature learned from the text sequence context, and the output contains forward and backward information corresponding to each word in the text sequence, capturing the semantic features in the text sequence.

进一步的所述步骤5中，CRF层基于BiLSTM层的输出上进行解码，解决标签序列之间的依赖关系，得到最优的标签序列，使得输出的标签序列满足整体上最优的概率；Further in step 5, the CRF layer decodes based on the output of the BiLSTM layer to solve the dependency between the label sequences and obtain the optimal label sequence, so that the output label sequence satisfies the overall optimal probability;

进一步的，在所述步骤6中，基于步骤5得到的命名实体识别结果，调整训练超参数(例如学习率、批次大小、训练轮数等)，以最大化准确率、召回率和F1值这三个评估指标的综合性能。Furthermore, in step 6, based on the named entity recognition result obtained in step 5, the training hyperparameters (such as learning rate, batch size, number of training rounds, etc.) are adjusted to maximize the comprehensive performance of the three evaluation indicators of accuracy, recall rate and F1 value.

本申请提供的技术方法带来的有益效果是：The beneficial effects of the technical method provided by this application are:

1.通过收集特定领域的非结构化数据，并对其进行预处理以生成结构化的领域语料库；1. Collect unstructured data in a specific field and preprocess it to generate a structured domain corpus;

2.本发明采用BERT、ALBERT和RoBERTa预训练语言模型对特定领域数据集进行训练，通过使用不同的模型充分挖掘文本的深层次语义特征，获得包含上下文语义信息的词向量；2. The present invention uses BERT, ALBERT and RoBERTa pre-trained language models to train data sets in specific fields, and obtains word vectors containing contextual semantic information by using different models to fully mine the deep semantic features of the text;

3.在预训练模型的基础上融合BiLSTM+CRF模型，通过跨领域的强大迁移能力以及多神经网络集成增强模型的泛化能力，最后基于领域语料对命名实体识别模型的超参数进行微调，以达到精确度、召回率和F1值性能最优，获得滑坡危险性评估命名实体识别的最佳结果。3. Based on the pre-trained model, the BiLSTM+CRF model is integrated to enhance the generalization ability of the model through powerful cross-domain migration capabilities and multi-neural network integration. Finally, the hyperparameters of the named entity recognition model are fine-tuned based on the domain corpus to achieve optimal performance in terms of accuracy, recall rate and F1 value, and obtain the best results for named entity recognition in landslide hazard assessment.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明方法的总体步骤流程示意图；FIG1 is a schematic diagram of the overall steps of the method of the present invention;

图2为本发明实施例中的语料库构建流程图；FIG2 is a flowchart of corpus construction in an embodiment of the present invention;

图3为本发明实施例中的基于BERT多神经网络模型结构示意图。FIG3 is a schematic diagram of the structure of a BERT-based multi-neural network model in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本实施例提供了一种基于BERT的多神经网络的滑坡危险性评估命名实体识别方法，如图1所示，主要包括以下步骤：This embodiment provides a method for named entity recognition of landslide hazard assessment based on a BERT-based multi-neural network, as shown in FIG1 , which mainly includes the following steps:

步骤1、语料库构建：收集包含滑坡危险性评估信息的英文文献摘要，对其进行预处理生成规范化的结构化数据，构建滑坡危险性评估领域预料库。如图2所示，具体详细子步骤如下：Step 1: Corpus construction: Collect English literature abstracts containing landslide hazard assessment information, preprocess them to generate standardized structured data, and build a landslide hazard assessment field prediction library. As shown in Figure 2, the specific detailed sub-steps are as follows:

步骤1.1：基于谷歌学术设置文献检索关键词Landslide Hazard Assessment，再使用Python进行爬取，收集滑坡危险性评估领域的英文文献摘要；Step 1.1: Based on Google Scholar, set the literature search keyword Landslide Hazard Assessment, and then use Python to crawl and collect English literature abstracts in the field of landslide hazard assessment;

步骤1.2：针对爬取的目标领域数据中出现的重复字符、空字符以及特殊符号等进行处理，输出为规范的文本数据构建滑坡危险性评估语料库，语料库中主要涵盖滑坡危险性评估方法(Method)，滑坡发生位置(Location)以及滑坡影响因子(Factor)三类信息。Step 1.2: Process the repeated characters, empty characters and special symbols in the crawled target domain data, and output them as standardized text data to build a landslide hazard assessment corpus. The corpus mainly covers three types of information: landslide hazard assessment method (Method), landslide location (Location) and landslide influencing factor (Factor).

步骤2、数据标注：首先采用Stanford CoreNLP自然语言工具包对数据进行分词、词性标注以及句法分析，再使用BIO(Beginning-Inside-Outside)标记法对滑坡危险性评估命名实体进行人工标注。Step 2: Data annotation: First, the Stanford CoreNLP natural language toolkit is used to perform word segmentation, part-of-speech tagging, and syntactic analysis on the data, and then the BIO (Beginning-Inside-Outside) tagging method is used to manually annotate the landslide hazard assessment named entities.

步骤2.2：使用BIO(Beginning-Inside-Outside)标记法对滑坡危险性评估命名实体进行人工标注，其中B–Beginning(开始)表示的是实体的初始标签，I–Inside(内部)表示的是实体的内部标签，O–Outside(外部)表示的是非实体。将领域语料库种包含滑坡发生位置信息，滑坡影响因子以及滑坡危险性评估方法三类信息标注为实体类型：Step 2.2: Use the BIO (Beginning-Inside-Outside) tagging method to manually label the landslide hazard assessment named entities, where B-Beginning represents the initial label of the entity, I-Inside represents the internal label of the entity, and O-Outside represents the non-entity. The three types of information in the domain corpus, including landslide location information, landslide influencing factors, and landslide hazard assessment methods, are annotated as entity types:

①滑坡危险性评估方法：M① Landslide hazard assessment method: M

②滑坡发生位置：L②Location of landslide: L

③滑坡影响因子：F③ Landslide influence factor: F

结合BIO标记法后，数据集中共有七个标签：B-M、I-M、B-F、I-F、B-L、I-L、O。After combining with the BIO labeling method, there are seven labels in the dataset: B-M, I-M, B-F, I-F, B-L, I-L, and O.

步骤3.1：将标注数据集按照7:1:2的方式划分为训练集、测试集和验证集，用于模型训练；Step 3.1: Divide the labeled dataset into training set, test set and validation set in a ratio of 7:1:2 for model training;

步骤3.2：将数据输入BERT模型中，在文本输入过程中BERT把训练样本的文本信息都存入对象InputFeatures中，BERT基于掩码语言模型(Masked Language Model，MLM)和下一句预测(Next Sentence Prediction，NSP)两个预训练方法，根据上下文环境动态生成词向量x＝(x₁,x₂,…,x_N)；Step 3.2: Input the data into the BERT model. During the text input process, BERT stores the text information of the training samples in the object InputFeatures. BERT dynamically generates word vectors x = (x ₁ , x ₂ , …, x _N ) according to the context based on two pre-training methods: Masked Language Model (MLM) and Next Sentence Prediction (NSP);

步骤3.3：ALBERT是一种轻量级的BERT模型，在内存和计算资源方面更加高效。在本实施例中，利用ALBERT模型来进行自然语言处理任务。将原始文本序列a＝(a₁,a₂,…,a_N)输入到ALBERT模型中，其中a_N表示文本序列中的第N个词，输出词向量y＝(y₁,y₂,…,y_N)；Step 3.3: ALBERT is a lightweight BERT model that is more efficient in terms of memory and computing resources. In this embodiment, the ALBERT model is used to perform natural language processing tasks. The original text sequence a = (a ₁ , a ₂ , ..., a _N ) is input into the ALBERT model, where a _N represents the Nth word in the text sequence, and the output word vector y = (y ₁ , y ₂ , ..., y _N );

步骤3.4：RoBERTa模型是BERT模型的改进版，在本实施例中，利用RoBERTa模型改进了训练任务和数据生成方式。将原始文本序列a＝(a₁,a₂,…,a_N)输入到RoBERTa模型中进行训练，输出词向量z＝(z₁,z₂,…,z_N)；Step 3.4: The RoBERTa model is an improved version of the BERT model. In this embodiment, the RoBERTa model is used to improve the training task and data generation method. The original text sequence a = (a ₁ , a ₂ , ..., a _N ) is input into the RoBERTa model for training, and the word vector z = (z ₁ , z ₂ , ..., z _N ) is output;

步骤4、将步骤3中得到的词向量x，y和z分别输入到BiLSTM(Bi-directionalLongShort-TermMemory)模型中，用于捕捉输入序列的长程依赖语义特征；Step 4: Input the word vectors x, y and z obtained in step 3 into the BiLSTM (Bi-directional Long Short-Term Memory) model to capture the long-range dependency semantic features of the input sequence;

BiLSTM神经元内部结构的计算过程如下：The calculation process of the internal structure of BiLSTM neurons is as follows:

前向计算：Forward calculation:

①遗忘门① Forget Gate

f_t＝σ(W_f·[h_t-1,x_t]+b_f)f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f )

②输出门②Output gate

i_t＝σ(W_i·[h_t-1,x_t]+b_i)i _t =σ(W _i ·[h _t-1 ,x _t ]+b _i )

③新记忆候选③New memory candidates

④更新当前记忆④Update current memory

⑤输出门⑤Output gate

o_t＝σ(W_o·[h_t-1,x_t]+b_o)o _t =σ(W _o ·[h _t-1 ,x _t ]+b _o )

⑥更新当前隐藏状态⑥Update the current hidden state

h_t＝o_t*tanh(C_t)h _t = o _t *tanh(C _t )

后向计算：Backward calculation:

①遗忘门① Forget Gate

f_t′＝σ(W_f′·[h_t-1,x_t′]+b_f′)f _t ′=σ(W _f ′·[h _t-1 ,x _t ′]+b _f ′)

②输出门②Output gate

i_t′＝σ(W_i′·[h_t-1,x_t′]+b_i′)i _t ′=σ(W _i ′·[h _t-1 ,x _t ′]+b _i ′)

③新记忆候选③New memory candidates

④更新当前记忆④Update current memory

⑤输出门⑤Output gate

o_t′＝σ(W_o′·[h_t-1,x_t′]+b_o′)o _t ′＝σ(W _o ′·[h _t-1 ,x _t ′]+b _o ′)

⑥更新当前隐藏状态⑥Update the current hidden state

h_t′＝o_t′*tanh(C_t′)h _t ′＝o _t ′*tanh(C _t ′)

其中，x_t、f_t、i_t、C_t、o_t、分别表示t时刻的输入、遗忘门、记忆门、细胞状态、输出门、临时细胞状态；W为模型参数，b为偏置向量；σ为sigmoid函数，tanh为双曲线函数。Among them, x _t , f _t , i _t , C _t , o _t , They represent the input, forget gate, memory gate, cell state, output gate, and temporary cell state at time t respectively; W is the model parameter, b is the bias vector; σ is the sigmoid function, and tanh is the hyperbolic function.

LSTM是正向传播网络，而BiLSTM由两个方向的LSTM网络组成。BiLSTM能够充分利用文本前后的信息，并更好地捕捉前后文本的语义相关性。与此不同的是，在LSTM中，状态信息始终是从前向后传递的，无法进行后向信息编码。LSTM is a forward propagation network, while BiLSTM consists of two directional LSTM networks. BiLSTM can make full use of the information before and after the text and better capture the semantic relevance of the text before and after. In contrast, in LSTM, the state information is always transmitted from front to back, and backward information encoding is not possible.

在BiLSTM层，词向量序列x，y，z分别从上下文进行特征学习，得到上文特征向量h_x，h_y，h_z和下文特征向量h_x'，h_y'，h_z'进行整合。In the BiLSTM layer, the word vector sequences x, y, and z are respectively learned from the context to obtain the previous feature vectors _hx , _hy , _hz and the following feature vectors _hx ', _hy ', _hz ' for integration.

步骤5、考虑到序列内标签之间的关系和约束，基于CRF解码层对整个输入序列的标注路径进行解码，并计算损失函数；Step 5: Considering the relationship and constraints between labels in the sequence, the annotation path of the entire input sequence is decoded based on the CRF decoding layer, and the loss function is calculated;

CRF是一种统计模型，用于标注序列数据。在NER任务中，它能够在整个标注序列上建模，考虑实体标签之间的相互关系。CRF通过对实体序列的概率进行建模，进一步提高NER的性能。具体的，CRF中有两类特征函数，分别是状态特征和转移特征，状态特征用当前节点(某个输出位置可能的状态中的某个状态称为一个节点)的状态分数表示，转移特征用上一个节点到当前节点的转移分数表示。其损失函数定义如下：CRF is a statistical model used to label sequence data. In the NER task, it can model the entire labeled sequence and consider the relationship between entity labels. CRF further improves the performance of NER by modeling the probability of entity sequences. Specifically, there are two types of feature functions in CRF, namely state features and transition features. The state feature is represented by the state score of the current node (a state in the possible states of a certain output position is called a node), and the transition feature is represented by the transition score from the previous node to the current node. Its loss function is defined as follows:

LossFunction＝/P_RealPath/(P₁+P₂+P₃+…+P_N)LossFunction＝/P _RealPath /(P ₁ +P ₂ +P ₃ +…+P _N )

CRF损失函数的计算，需要用到真实路径分数(包括状态分数和转移分数)，其他所有可能的路径的分数(包括状态分数和转移分数)。The calculation of the CRF loss function requires the true path score (including state score and transfer score) and the scores of all other possible paths (including state score and transfer score).

模型集成：如图3所示将BERT、ALBERT及RoBERTa与BiLSTM和CRF结合起来形成一个端到端的模型。首先使用BERT及其变体模型充分提取滑坡危险性评估标注文本的上下文语义特征，获取动态词向量，再结合BiLSTM特征提取层的前向LSTM网络层捕捉到当前词的前向文本信息，后向LSTM层捕捉后向文本语义信息，而CRF层根据之前训练得到的标签间的转移矩阵对BiLSTM网络得到的概率分数进行规范化检验，之后使用Viterbi计算出出现概率最大的全局标记序列，最后输入文本序列中实体的识别结果。Model integration: As shown in Figure 3, BERT, ALBERT and RoBERTa are combined with BiLSTM and CRF to form an end-to-end model. First, BERT and its variant models are used to fully extract the contextual semantic features of the landslide hazard assessment annotation text to obtain dynamic word vectors. Then, the forward LSTM network layer of the BiLSTM feature extraction layer is combined to capture the forward text information of the current word, and the backward LSTM layer captures the backward text semantic information. The CRF layer performs a normalized test on the probability score obtained by the BiLSTM network based on the transfer matrix between labels obtained by previous training. Then, Viterbi is used to calculate the global tag sequence with the highest probability of occurrence, and finally the recognition result of the entity in the input text sequence is input.

模型训练和评估：使用已标注的滑坡危险性评估过训练数据对模型进行训练，并使用验证集对模型进行调优，以避免过拟合。评估阶段，使用测试数据对模型进行评估，使用准确率(P)、召回率(R)、F1值等指标来评估模型的性能。Model training and evaluation: The model is trained using labeled landslide hazard assessment training data, and the model is tuned using the validation set to avoid overfitting. In the evaluation phase, the model is evaluated using test data, and the performance of the model is evaluated using indicators such as precision (P), recall (R), and F1 value.

其中TP：预测结果为正样本真实情况也为正样本，正确识别的实体的数量；FP：预测结果为正样本而真实情况为负样本，识别出实体但识别错误的数量；FN：预测结果为负样本，但真实情况为正样本，应该识别的实体但未识别出的数量。Among them, TP: the number of entities correctly identified when the predicted result is a positive sample and the actual situation is also a positive sample; FP: the number of entities recognized but incorrectly identified when the predicted result is a positive sample and the actual situation is a negative sample; FN: the number of entities that should be recognized but are not recognized when the predicted result is a negative sample but the actual situation is a positive sample.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables one skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to one skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but rather to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The identifying method of the landslide hazard assessment named entity based on the BERT multi-neural network is characterized by comprising the following steps:

Step 1, corpus construction: collecting English literature abstracts containing landslide hazard assessment information, preprocessing the English literature abstracts to generate normalized structured data, and constructing a prediction library in the landslide hazard assessment field;

Step 2, data marking: performing word segmentation, part-of-speech labeling and syntactic analysis on the data by adopting Stanford CoreNLP natural language toolkits, and manually labeling landslide hazard assessment named entities by using a BIO (Beginning-Inside-Outside) labeling method, wherein the manual labeling method comprises landslide occurrence positions, landslide influence factors and landslide hazard assessment methods;

Step 3, dividing the annotation data into a training set, a testing set and a verification set according to a certain proportion, and coding the annotation data by utilizing a BERT and a pre-training language model such as variants ALBERT and RoBERTa thereof to obtain word vector representation containing context semantic information;

Step 4, inputting the word vector obtained in the step 3 into BiLSTM (Bi-directional Long Short-Term Memory) model for capturing the long-range dependent semantic features of the input sequence;

Step 5, considering the relation and constraint among the labels in the sequence, decoding the labeling path of the whole input sequence based on the CRF decoding layer;

and 6, performing fine adjustment on the target domain corpus data set by using the BERT-based multi-neural network to obtain an optimal named entity recognition result.

2. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: in step 1, corpus construction is based on google academic set-up document retrieval topics: LANDSLIDE HAZARDASSESSMENT, crawling by using Python, and collecting English literature abstracts in the landslide hazard assessment field, wherein preprocessing comprises removing repeated characters, blank characters, special symbols and the like in corpus, and outputting text data with high accuracy.

3. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: in the step 2, labeling is performed on the preprocessed landslide hazard assessment corpus, and the specific labeling process comprises the following steps:

Step 2.1: labeling the data with Stanford CoreNLP natural language toolkits using a Python interactive interface: firstly, segmenting a text into sentences by using sentence segmentation (SENTENCE SPLITTING), segmenting the text into token(s) such as words, punctuations and the like by using segmentation words (Tokenization), distributing corresponding parts of speech (such as nouns, verbs, adjectives and the like) for each token (Part-of-SPEECH TAGGING), and finally constructing a syntactic analysis tree and a Part of speech structure of the text by using grammar analysis (Parsing);

Step 2.2: the landslide hazard assessment named entities were manually labeled using BIO (Beginning-Inside-Outside) labeling, where B-Beginning (beginning) represents the initial label of the entity, I-Inside represents the internal label of the entity, and O-Outside (outside) represents the non-entity. The domain corpus is marked as 3 types of entities, and the 3 types of entities respectively comprise landslide occurrence position information, landslide influence factors and landslide risk assessment methods.

4. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: the specific content in the step 3 comprises:

Step 3.1: dividing the labeling data set into a training set, a testing set and a verification set according to a mode of 7:1:2;

Step 3.2: training a BERT-based full-supervision scene pre-training language model on a laboratory server based on the training set, the testing set and the verification set, and outputting a word vector x= (x ₁,x₂,…,x_N);

Step 3.3: training a ALBERT-based full-supervision scene pre-training language model on a laboratory server based on the training set, the testing set and the verification set, and outputting a word vector y= (y ₁,y₂,…,y_N);

Step 3.4: based on the training set, the testing set and the verification set, training a RoBERTa-based full-supervision scene pre-training language model on a laboratory server, and outputting a word vector z= (z ₁,z₂,…,z_N).

5. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: in step 4, the word vectors x, y and z obtained in step 3 are respectively input into BiLSTM layers, feature learning is carried out on the word vectors from the context of the text sequence, forward and backward information corresponding to each word in the text sequence is output, and semantic features in the text sequence are captured.

6. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: in step5, the CRF layer decodes based on the output of BiLSTM layers, solves the dependency relationship between the tag sequences, and obtains an optimal tag sequence, so that the output tag sequence meets the overall optimal probability.

7. The BERT-based multi-neural network landslide hazard assessment named entity recognition method of claim 1, wherein the method comprises the steps of: in step 6, based on the named entity recognition result obtained in step 5, training super parameters (such as learning rate, batch size, training round number, etc.) are adjusted for the target domain data set to maximize the comprehensive performance of three evaluation indexes, namely accuracy, recall and F1 value.