CN116467930A

CN116467930A - A General Modeling Method for Structured Data Based on Transformer

Info

Publication number: CN116467930A
Application number: CN202310239904.9A
Authority: CN
Inventors: 郭颖; 熊媛媛; 李喜武; 刁克红; 孙广源; 梁浩然; 梁荣华
Original assignee: Zhejiang University of Technology ZJUT; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Current assignee: Zhejiang University of Technology ZJUT; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-07-21

Abstract

The invention relates to a general modeling method of structured data based on a transducer, which comprises the steps of firstly removing irrelevant features from original data, then using different embedding methods for category features and numerical features, then splicing feature vectors after embedding the category features and the numerical features, and inputting the spliced feature vectors into a transducer+neural network (improved transducer) and an MLP+neural network, wherein the transducer+neural network is formed by adding a leakage Gate before the original transducer and an MLP+neural network after the original transducer, and finally distributing different weights for output values of two modules. The invention is applicable to both classification problems and multi-classification problems.

Description

A General Modeling Method for Structured Data Based on Transformer

技术领域technical field

本发明属于结构化数据处理领域，具体涉及一种基于Transformer的结构化数据通用建模方法The invention belongs to the field of structured data processing, in particular to a Transformer-based general modeling method for structured data

背景技术Background technique

表格数据是最常用的数据形式，它在各种应用中无处不在，如基于病历的医疗诊断，金融领域的预测分析，网络安全等。目前一般情况下是使用基于树的集成方法，如梯度提升决策树GBDT，在处理表格数据时具有很好的效果，主要体现在其对于连续性的数值特征有更好的学习能力，可以自动选择并组合有用的数值特征，通过计算信息增益有效构建决策树。然而由于类别特征一般转化为高维稀疏的独热one-hot编码，梯度提升决策树GBDT在处理此类数据时将获得很小的信息增益，不能有效地学习此类特征。Tabular data is the most commonly used form of data, and it is ubiquitous in various applications such as medical diagnosis based on medical records, predictive analytics in the financial sector, cybersecurity, etc. At present, tree-based integration methods are generally used, such as the gradient boosting decision tree GBDT, which has a good effect in processing tabular data, mainly because it has better learning ability for continuous numerical features, and can automatically select and combine useful numerical features, and effectively construct decision trees by calculating information gain. However, since category features are generally transformed into high-dimensional sparse one-hot encoding, the gradient boosting decision tree GBDT will obtain little information gain when processing such data, and cannot effectively learn such features.

近年来，以Transformer为基础框架的方法在计算机视觉领域和自然语言处理领域取得了巨大的成功。在计算机视觉领域，卷积核的设置限制了感受野的大小，导致网络往往需要多层的堆叠才能关注到整个特征图；在自然语言处理领域，RNN或LSTM要经过若干时间步步骤的信息累积才能将两者联系起来，距离越远，有效捕获的可能性越小。而Transformer中的自注意力self-attention可以捕获全局的注意力信息。除此外，对于增加计算的并行性也有直接帮助作用，这也是Transformer被广泛使用的主要原因。In recent years, methods based on the Transformer framework have achieved great success in the fields of computer vision and natural language processing. In the field of computer vision, the setting of the convolution kernel limits the size of the receptive field, resulting in the network often requiring multi-layer stacking to pay attention to the entire feature map; in the field of natural language processing, RNN or LSTM needs to accumulate information in several time steps to connect the two. The farther the distance is, the less likely it is to be effectively captured. The self-attention self-attention in Transformer can capture global attention information. In addition, it also directly helps to increase the parallelism of calculations, which is the main reason why Transformer is widely used.

多层感知机MLP可能是最简单和最通用的神经网络，多层感知机MLP通常学习参数嵌入来编码分类数据特征，但是由于它们的体系结构比较浅并且使用上下文无关的嵌入，所以对缺失和噪声数据不稳健，最重要的是，在大多数情况下，多层感知机MLP的表现不如基于树的模型。Multilayer perceptron (MLP) is probably the simplest and most general neural network. Multilayer perceptron MLP usually learns parameter embeddings to encode categorical data features, but due to their relatively shallow architecture and use of context-free embeddings, they are not robust to missing and noisy data. Most importantly, in most cases, multilayer perceptron MLP does not perform as well as tree-based models.

综上所述，有效地学习表格数据并且克服上述问题是深度学习应用在表格领域的亟需解决的问题。To sum up, effectively learning tabular data and overcoming the above problems are urgent problems to be solved in the application of deep learning in the tabular field.

发明内容Contents of the invention

为了解决现有的基于树的集成方法在表格预测中存在的不足，本发明提供一种基于Transformer的结构化数据通用建模方法。In order to solve the shortcomings of existing tree-based integration methods in table prediction, the present invention provides a Transformer-based general modeling method for structured data.

为了解决上述技术问题本发明采用以下技术方案实现：In order to solve the above-mentioned technical problems, the present invention adopts the following technical solutions to realize:

一种基于Transformer的结构化数据通用建模方法，包括如下步骤：A Transformer-based general modeling method for structured data, comprising the following steps:

(1)输入公开数据集进行特征处理：得到原始数据后，需要剔除无关特征，将数据中的类别特征编码为可识别的数字形式，数值特征按照标准化操作进行缩放；(1) Input the public data set for feature processing: After obtaining the original data, it is necessary to eliminate irrelevant features, encode the category features in the data into recognizable digital forms, and scale the numerical features according to standardized operations;

(2)将特征处理之后的特征向量进行词嵌入Embedding：在通过Transformer+编码器之前，将数值特征和类别特征的高维离散的数据通过词嵌入Embedding投影到低维稠密的d维空间中；(2) Word embedding of the feature vector after feature processing: before passing through the Transformer+encoder, the high-dimensional discrete data of numerical features and category features are projected into a low-dimensional dense d-dimensional space through word embedding Embedding;

(3)将上一步得到的词嵌入Embedding向量输入到模型的两个分支中：模型分为Transformer+神经网络和MLP+神经网络两个分支，将训练数据经过词嵌入Embedding之后的特征向量输入到Transformer+神经网络中进行学习，得到神经网络的原始输出，同样的输入到MLP+神经网络中进行建模学习，得到一个训练好的MLP+神经网络，将Transformer+神经网络与MLP+神经网络融合为一个分类模型，故将两部分原始输出加权求和形成模型的整体输出值，之后经过激活函数得到分类模型给出的整体预测结果；(3) Input the word embedding embedding vector obtained in the previous step into two branches of the model: the model is divided into two branches of Transformer+neural network and MLP+neural network, and the feature vector of the training data after word embedding Embedding is input into the Transformer+neural network for learning, and the original output of the neural network is obtained, and the same input is input into the MLP+neural network for modeling learning, and a trained MLP+neural network is obtained. Transformer+neural network and MLP+neural network are fused into a classification model, so the two parts of the original output The weighted sum forms the overall output value of the model, and then the overall prediction result given by the classification model is obtained through the activation function;

(4)采用焦点损失Focal Loss作为目标函数指导训练：利用预处理过的训练数据对分类模型进行训练，采用焦点损失Focal Loss作为目标函数指导训练过程，搜索最佳参数，得到训练好的分类模型；(4) Use the focal loss Focal Loss as the objective function to guide the training: use the preprocessed training data to train the classification model, use the focal loss Focal Loss as the objective function to guide the training process, search for the best parameters, and obtain the trained classification model;

(5)接受其他表格数据进行预测：将接受待分类的表格数据进行所述预处理，然后输入到所述训练好的分类模型中进行分类预测。(5) Accepting other tabular data for prediction: the preprocessing will be performed on the tabular data to be classified, and then input into the trained classification model for classification prediction.

进一步，所述步骤(1)中，输入特征处理的方法包括如下步骤：Further, in said step (1), the method for input feature processing includes the following steps:

(1-1)剔除无用特征：根据先验知识对每个数据集进行特征识别，将无用特征剔除；(1-1) Eliminate useless features: perform feature recognition on each data set based on prior knowledge, and remove useless features;

(1-2)处理连续特征：连续特征用标准缩放器StandardScaler进行标准化，将数值特征进行缩放操作；(1-2) Processing continuous features: continuous features are standardized with the standard scaler StandardScaler, and the numerical features are scaled;

(1-3)处理类别特征：类别特征用标签编码器LabelEncoder将特征编码为数字形式，为了避免编码稀疏，使得计算代价变大，不进行独热one-hot编码。(1-3) Processing category features: Category features use the label encoder LabelEncoder to encode the features into digital form. In order to avoid encoding sparseness and increase the computational cost, one-hot encoding is not performed.

进一步，所述步骤(2)中，词嵌入Embedding是将特征向量映射到低维空间向量的技术，可以将离散的特征向量转换为连续的向量表示，针对类别特征做一般的词嵌入Embedding处理，针对数值特征使用一个单独的全连接层，每个数值特征都具有ReLU非线性，从而将1维输入投影到d维空间，随后将类别特征和数值特征的嵌入在第一维度进行连接。Further, in the step (2), word embedding Embedding is a technology for mapping feature vectors to low-dimensional space vectors, which can convert discrete feature vectors into continuous vector representations, perform general word embedding Embedding processing for category features, and use a separate fully connected layer for numerical features. Each numerical feature has a ReLU nonlinearity, thereby projecting 1-dimensional input to d-dimensional space, and then connecting the embedding of category features and numerical features in the first dimension.

进一步，所述步骤(3)中，神经网络模型包括如下几个部分：Further, in the step (3), the neural network model includes the following parts:

(3-1)所述神经网络Transformer+相对于原始转换器Transformer有如下改进，在原始转换器Transformer的编码器encoder之前加入渗漏门Leaky Gate，并在之后加入MLP+神经网络，渗漏门Leaky Gate是两个简单元素的组合，即基于元素级别的线性转换和LeakyRelu激活函数；(3-1) The neural network Transformer+ has the following improvements relative to the original transformer Transformer. A leaky gate Leaky Gate is added before the encoder encoder of the original transformer Transformer, and an MLP+ neural network is added afterwards. The leaky gate Leaky Gate is a combination of two simple elements, that is, an element-level linear transformation and a LeakyRelu activation function;

(3-2)所述神经网络MLP+相对于多层感知机MLP有如下改进，从多层感知机MLP的子块开始，用Ghost归一化Ghost Batch Norm(GBN)代替普通批量归一化Batch Norm，在子块右侧添加了线性跳跃层，跳跃层只是一个完全连接的线性层，然后是LeakyRelu激活函数，最后在多层感知机MLP子块和线性跳跃层之前添加渗漏门Leaky Gate。Ghost归一化Ghost Batch Norm(GBN)允许使用大批量数据进行训练，本发明使用Ghost归一化GhostBatch Norm(GBN)的一个很大的动机是加快训练。(3-2) The neural network MLP+ has the following improvements relative to the multi-layer perceptron MLP. Starting from the sub-block of the multi-layer perceptron MLP, the ordinary batch normalization Batch Norm is replaced by Ghost normalized Ghost Batch Norm (GBN), and a linear skip layer is added on the right side of the sub-block. Ghost normalized Ghost Batch Norm (GBN) allows training with large batches of data, and a great motivation for using Ghost normalized GhostBatch Norm (GBN) in the present invention is to speed up training.

本发明提供了一种基于Transformer的结构化数据通用建模方法，其特点在于采取Transformer同时处理类别特征和数值特征，在充分保留Transformer模型性能的前提下，将其与多层感知机MLP融合为一个模型，而不是分别给出类别预测后加权投票，因此可以在端到端的训练中通过引入损失函数进行优化，有效增强模型的识别能力。与现有技术相比，本发明的积极效果：The present invention provides a Transformer-based general modeling method for structured data, which is characterized in that the Transformer is used to process category features and numerical features at the same time, and on the premise of fully retaining the performance of the Transformer model, it is fused with a multi-layer perceptron MLP into a model, instead of giving category predictions and then weighted voting, so it can be optimized by introducing a loss function in end-to-end training, effectively enhancing the recognition ability of the model. Compared with prior art, positive effect of the present invention:

1.本发明提出一种将类别特征和数值特征一起进入Transformer的数据处理方法，这意味着类别特征和数值特征之间相关性的任何信息都不会丢失。1. The present invention proposes a data processing method that enters both categorical features and numerical features into Transformer, which means that any information about the correlation between categorical features and numerical features will not be lost.

2.本发明提出一种基于Transformer的结构化数据通用建模方法，可以有效的将较简单的MLP神经网络和较复杂的基于注意力的Transformer神经网络融合在一起，从而对类别特征和数值特征进行学习。2. The present invention proposes a Transformer-based general modeling method for structured data, which can effectively integrate the simpler MLP neural network and the more complex attention-based Transformer neural network, thereby learning category features and numerical features.

3.本发明采用公开的adult，blastchar，shrutime等七个公开数据集评估了提出的新模型，实验结果表明本发明的方法在二分类场景下都要优于其他先进方法。3. The present invention evaluates the proposed new model using seven public datasets such as adult, blastchar, and shrutime, and the experimental results show that the method of the present invention is superior to other advanced methods in the binary classification scenario.

附图说明Description of drawings

图1为本发明方法的整体框架图。Fig. 1 is the overall frame diagram of the method of the present invention.

图2为本发明的MLP+的处理流程。Fig. 2 is the processing flow of the MLP+ of the present invention.

具体实施方案specific implementation plan

下面将结合本申请实施例中附图，对本发明的技术方案进行清晰、完整的描述。此处描述的具体实施仅用于解释本发明，并不用于限定本发明。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present application. The specific implementations described here are only used to explain the present invention, not to limit the present invention.

实施例1Example 1

图1为本发明的整体架构图，一种基于Transformer的结构化数据通用建模方法，具体步骤如下：Fig. 1 is the overall structure diagram of the present invention, a kind of general modeling method of structured data based on Transformer, concrete steps are as follows:

步骤(1)输入特征处理；Step (1) input feature processing;

在数据层面，使用adult，blastchar，spambase等公开数据集，有些数据集中只有数值特征，有些数据集中既包含数值特征又包含类别特征，同时，数据被划分为训练集和测试集两部分。对于不同的数据集，我们利用先验知识剔除掉一部分无用的特征。由于大部分类别特征是字符串的形式，因此将其编码为模型可以识别的数字(1，2，3···)形式；对于数值特征，采取标准化操作进行缩放。At the data level, public data sets such as adult, blastchar, and spambase are used. Some data sets only have numerical features, and some data sets contain both numerical features and categorical features. At the same time, the data is divided into two parts: training set and test set. For different data sets, we use prior knowledge to remove some useless features. Since most of the categorical features are in the form of strings, they are encoded as numbers (1, 2, 3...) that the model can recognize; for numerical features, standardization operations are used for scaling.

对于原始数据(包含训练集和测试集)，去除不必要的特征，对类别型特征进行数值编码，对数值特征进行标准化处理，得到数据集D＝{(x_i，y_i)，y_i∈[0，classnum)，i＝1，2，3，···，N}(其中，x_i是每个样本的特征向量，y是x_i对应的标签，classnum是类别数，N为样本数)，区分不同的类型的特征，将数据分为类别型特征x_cat和数值型特征x_cont。For the original data (including training set and test set), unnecessary features are removed, categorical features are numerically coded, and numerical features are standardized to obtain a data set D={( _xi , y _i ₎ , y _i ∈ [0, classnum), i=1, 2, 3,..., N} (where xi is the feature vector of each sample, y is the label corresponding _to xi, classnum is the number of categories, and N is the number of samples). To distinguish different types of features, divide the data into categorical features x _cat and Numeric feature x _cont .

步骤(2)类别特征和数值特征嵌入；Step (2) category feature and numerical feature embedding;

嵌入层E将每个特征嵌入到d维空间中，为了有效处理表格数据，本发明区别对待离散型的类别特征和连续型的数值特征。本发明通过词嵌入Embedding技术得到类别特征的新的嵌入表示，通过使用全连接层得到数值特征的新的嵌入表示，是具有类别或数值特征的单个样本，嵌入层e对不同类型特征使用不同的嵌入函数，对于给定的/>得到/>然后在特征维度进行拼接，E_Φ(X)是所有特征经过嵌入表示的结果。The embedding layer E embeds each feature into a d-dimensional space. In order to effectively process tabular data, the present invention treats discrete categorical features and continuous numerical features differently. The present invention obtains a new embedded representation of category features through word embedding Embedding technology, and obtains a new embedded representation of numerical features by using a fully connected layer. is a single sample with categorical or numerical features, the embedding layer e uses different embedding functions for different types of features, for a given /> get /> Then stitching is performed in the feature dimension, and E _Φ (X) is the result of embedding representation of all features.

E_Φ(x)＝{e_Φ1(x₁)，...，e_ΦN(x_N)} (1)E _Φ (x) = {e _Φ1 (x ₁ ),..., e _ΦN (x _N )} (1)

步骤(3)将嵌入输出的特征向量输入模型；Step (3) inputting the feature vector of the embedding output into the model;

(3-1)前一步输出的特征向量先进入渗漏门Leaky Gate，渗漏门Leaky Gate是两个简单元素的组合，一个元素级别的线性转换，然后是LeakyRelu激活函数，LeakyRelu激活函数将让任何正值通过而不改变，并将任何负值压缩到几乎为零，换句话说，如果w_i和b_i是第i列的线性层参数，则第i列的渗漏门Leaky Gate为：(3-1) The feature vector output from the previous step first enters the Leaky Gate, which is a combination of two simple elements, an element-level linear transformation, and then the LeakyRelu activation function. The LeakyRelu activation function will let any positive value pass through without changing, and compress any negative value to almost zero. In other words, if w _i and b _i are the linear layer parameters of the i-th column, the leaky gate of the i-th column is:

渗漏门Leaky Gate旨在充当简单的滤波器，对于每一列具有不同的行为，其中是否屏蔽或者通过取决于每个单独的值。A Leaky Gate is intended to act as a simple filter, with a different behavior for each column, where whether to block or pass depends on each individual value.

转换器Transformer层以渗漏门Leaky Gate的输出作为输入，并将输出传递给第二个转换器Transformer层，以此类推，如图1所示，最后一个转换器Transformer层的输出将直接输入到MLP+神经网络(改进的多层感知机MLP)中，MLP+神经网络如图2所示，得到模型的输出值y_Transformer+。其中θ₁，θ₂和θ₃分别是渗漏门Leaky Gate，转换器Transformer，MLP+神经网络的模型参数。The Transformer layer of the converter takes the output of the Leaky Gate as input, and passes the output to the second Transformer layer, and so on. As shown in Figure 1, the output of the last Transformer layer will be directly input into the MLP+ neural network (improved multi-layer perceptron MLP). The MLP+ neural network is shown in Figure 2, and the output value y _Transformer+ of the model is obtained. Among them, θ ₁ , θ ₂ and θ ₃ are the model parameters of Leaky Gate, Transformer, and MLP+neural network, respectively.

y_Transformer+(x)＝M(f_transformer(G_Θ(E_Φ(x)；θ₁)；θ₂)；θ₃) (3)y _Transformer+ (x)=M(f _transformer (G _Θ (E _Φ (x); θ ₁ ); θ ₂ ); θ ₃ ) (3)

(3-2)同样的，前一步输出的特征向量进入MLP+神经网络(图1右分支)中，得到模型的输出值y_mlp+。(3-2) Similarly, the feature vector output in the previous step enters the MLP+ neural network (right branch in Figure 1), and the output value y _mlp+ of the model is obtained.

y_mlp+＝M(E_Φ(X)；θ₁) (4)y _mlp+ = M(E _Φ (X); θ ₁ ) (4)

步骤(4)融合左右分支；Step (4) fusion of left and right branches;

具体地，为了将改进的转换器Transformer和改进的多层感知机MLP结合起来得到整个模型的预测并执行端到端训练，本发明为两个模块的输出值分配了不同的权重w₁和w₂(两个权重可以通过反向传播训练学习得到)，最终模型输出的预测概率如式(5)，σ表示激活函数(二分类为sigmoid，多分类为softmax)。Specifically, in order to combine the improved converter Transformer and the improved multi-layer perceptron MLP to obtain the prediction of the entire model and perform end-to-end training, the present invention assigns different weights w ₁ and w ₂ to the output values of the two modules (the two weights can be learned through backpropagation training), and the predicted probability of the final model output As in formula (5), σ represents the activation function (sigmoid for binary classification and softmax for multi-classification).

步骤(5)基于焦点损失Focal Loss训练分类模型；Step (5) trains the classification model based on focal loss Focal Loss;

利用预处理过的数据进行模型训练，采用焦点损失Focal Loss作为损失函数指导训练过程，焦点损失Focal Loss可以使得模型更加关注难以分类的少数类样本，减轻由多数类造成的偏差。The preprocessed data is used for model training, and the focal loss Focal Loss is used as the loss function to guide the training process. The focal loss Focal Loss can make the model pay more attention to the minority class samples that are difficult to classify, and reduce the deviation caused by the majority class.

根据式(5)，模型的损失可以表示为式(6)，代表损失函数，y为样本x的真是标签。According to Equation (5), the loss of the model can be expressed as Equation (6), Represents the loss function, and y is the real label of the sample x.

为了应对数据类别不平衡问题，本发明采用了代价敏感的思想，引入了焦点损失Focal Loss作为模型的损失函数。焦点损失Focal Loss最初是被用来解决目标检测任务中的类别不平衡问题，是对传统交叉熵损失的改进。本发明则将其引入表格分类领域。针对二分类问题，焦点损失Focal Loss可以表示为式(7)的形式，其中是式(5)中定义的概率预测，y_i是输入样本的标签，α是平衡因子，γ≥0被称为聚焦参数。In order to deal with the problem of unbalanced data categories, the present invention adopts the idea of cost sensitivity, and introduces Focal Loss as the loss function of the model. Focus loss Focal Loss was originally used to solve the category imbalance problem in target detection tasks, and is an improvement on the traditional cross-entropy loss. The present invention introduces it into the field of table classification. For the binary classification problem, the focal loss Focal Loss can be expressed in the form of formula (7), where is the probability prediction defined in Equation (5), _yi is the label of the input sample, α is the balance factor, and γ≥0 is called the focusing parameter.

对于多分类问题，可以采取一对多的思想，将式(7)扩展为式(8)，其中y为类别标签的独热编码one-hot表示，为形如(m，n)的概率输出(m为样本数，n为类别数)。For multi-classification problems, one-to-many thinking can be adopted to extend formula (7) to formula (8), where y is the one-hot encoded one-hot representation of the category label, is a probability output of the form (m, n) (m is the number of samples, n is the number of categories).

基于式(7)和式(8)定义的损失函数，就可以进行端到端的模型训练，采用梯度下架法，选取loss最小的模型。Based on the loss function defined by Equation (7) and Equation (8), end-to-end model training can be performed, and the gradient off-shelf method is used to select the model with the smallest loss.

实施例2Example 2

应用本发明提供一种基于Transformer的结构化数据通用建模方法的商品推荐方法。Application of the present invention provides a commodity recommendation method based on Transformer's general modeling method for structured data.

图1为本发明的整体架构图，所述方法的具体步骤如下：Fig. 1 is an overall architecture diagram of the present invention, and the concrete steps of described method are as follows:

步骤(1)输入特征处理。Step (1) Input feature processing.

在一种推荐系统的应用场景下，以商品推荐系统为例，一种基于Transformer的结构化数据通用建模方法的作用是为商品推荐系统根据用户的行为进行分类，从而推荐对应类型的商品。在数据层面，使用online_shoppers公开数据集，数据集中既包含有数值特征又有类别特征，同时，数据被划分为训练集和测试集两部分。对于本数据集，我们利用先验知识剔除掉一部分无用的特征。由于大部分类别特征是字符串的形式，因此将其编码为模型可以识别的数字(1，2，3···)形式；对于数值特征，采取标准化操作进行缩放。In the application scenario of a recommendation system, taking the product recommendation system as an example, a general modeling method for structured data based on Transformer is used to classify the product recommendation system according to the user's behavior, thereby recommending corresponding types of products. At the data level, use online_shoppers to expose the data set, which contains both numerical features and categorical features. At the same time, the data is divided into two parts: training set and test set. For this dataset, we use prior knowledge to remove some useless features. Since most of the categorical features are in the form of strings, they are encoded as numbers (1, 2, 3...) that the model can recognize; for numerical features, standardization operations are used for scaling.

步骤(2)类别特征和数值特征嵌入。Step (2) categorical features and numerical feature embeddings.

嵌入层E将每个特征嵌入到d维空间中，为了有效处理表格数据，本发明区别对待离散型的类别特征和连续型的数值特征。本发明通过词嵌入Embedding技术得到类别特征的新的嵌入表示，通过使用全连接层得到数值特征的新的嵌入表示，是具有类别或数值特征的单个样本，嵌入层e对不同类型特征使用不同的嵌入函数，对于给定的/>得到/>然后在特征维度进行拼接，E_Φ(X)是所有特征经过嵌入表示的结果：The embedding layer E embeds each feature into a d-dimensional space. In order to effectively process tabular data, the present invention treats discrete categorical features and continuous numerical features differently. The present invention obtains a new embedded representation of category features through word embedding Embedding technology, and obtains a new embedded representation of numerical features by using a fully connected layer. is a single sample with categorical or numerical features, the embedding layer e uses different embedding functions for different types of features, for a given /> get /> Then splice in the feature dimension, E _Φ (X) is the result of embedding representation of all features:

步骤(3)将嵌入输出的特征向量输入模型。Step (3) feeds the feature vector of the embedding output into the model.

转换器Transformer层以渗漏门Leaky Gate的输出作为输入，并将输出传递给第二个转换器Transformer层，以此类推，如图1所示，最后一个转换器Transformer层的输出将直接输入到MLP+神经网络(改进的多层感知机MLP)中，MLP+神经网络如图2所示，得到神经网络的输出值y_Transformer+。其中θ₁，θ₂和θ₃分别是渗漏门Leaky Gate，转换器Transformer，MLP+神经网络的模型参数。The Transformer layer of the converter takes the output of the Leaky Gate as input, and passes the output to the second Transformer layer, and so on. As shown in Figure 1, the output of the last Transformer layer will be directly input into the MLP+ neural network (improved multi-layer perceptron MLP). The MLP+ neural network is shown in Figure 2, and the output value y _Transformer+ of the neural network is obtained. Among them, θ ₁ , θ ₂ and θ ₃ are the model parameters of Leaky Gate, Transformer, and MLP+neural network, respectively.

(3-2)同样的，前一步输出的特征向量进入MLP+神经网络(图1右分支)中，得到神经网络的输出值y_mlp+：(3-2) Similarly, the feature vector output in the previous step enters the MLP+ neural network (right branch in Figure 1), and the output value y _mlp+ of the neural network is obtained:

y_mlp+＝M(E_Φ(X)；θ₁) (4)y _mlp+ = M(E _Φ (X); θ ₁ ) (4)

步骤(4)融合左右分支。Step (4) fuses the left and right branches.

具体地，为了将改进的Transformer和改进的多层感知机MLP结合起来得到整个模型的预测并执行端到端训练，本发明为两个模块的输出值分配了不同的权重w₁和w₂(两个权重可以通过反向传播训练学习得到)，最终模型输出的预测概率如式(5)，σ表示激活函数(二分类为sigmoid，多分类为softmax)。Specifically, in order to combine the improved Transformer and the improved multi-layer perceptron MLP to obtain the prediction of the entire model and perform end-to-end training, the present invention assigns different weights w ₁ and w ₂ to the output values of the two modules (the two weights can be learned through backpropagation training), and the predicted probability of the final model output As in formula (5), σ represents the activation function (sigmoid for binary classification and softmax for multi-classification).

步骤(5)基于焦点损失Focal Loss训练分类模型。Step (5) trains the classification model based on Focal Loss.

步骤(6)输入用户特征至模型，实现商品推荐。Step (6) Input user characteristics into the model to realize product recommendation.

当商品推荐系统获取用户行为或是在原有行为的基础上增加或修改行为时，商品推荐系统将新构建的用户行为输入至模型中，获得新的分类结果，进而推荐相对应的商品。When the product recommendation system acquires user behavior or adds or modifies behavior on the basis of the original behavior, the product recommendation system inputs the newly constructed user behavior into the model, obtains new classification results, and then recommends corresponding products.

所述数值特征和类别特征嵌入模块，用于收集新的用户行为所嵌入后的特征向量，用于之后的模型输入。The numerical feature and category feature embedding module is used to collect feature vectors embedded in new user behaviors for subsequent model input.

所述将嵌入输出的特征向量输入模型模块，用与将新的特征向量输入模型进行参数调整。Said inputting the embedding output feature vector into the model module is used to adjust the parameters by inputting the new feature vector into the model.

所述基于焦点损失Focal Loss训练分类模型模块，用于训练参数改变之后的新模型。The Focal Loss-based training classification model module is used to train a new model after parameters are changed.

本领域普通技术人员可以理解，以上所述仅为发明的优选实例而已，并不用于限制发明，尽管参照前述实例对发明进行了详细的说明，对于本领域的技术人员来说，其依然可以对前述各实例记载的技术方案进行修改，或者对其中部分技术特征进行等同替换。凡在发明的精神和原则之内，所做的修改、等同替换等均应包含在发明的保护范围之内。Those of ordinary skill in the art can understand that the above descriptions are only preferred examples of the invention, and are not intended to limit the invention. Although the invention has been described in detail with reference to the aforementioned examples, those skilled in the art can still modify the technical solutions described in the aforementioned examples, or perform equivalent replacements for some of the technical features. All modifications, equivalent replacements, etc. within the spirit and principles of the invention shall be included in the scope of protection of the invention.

Claims

1. The general modeling method for the structured data based on the Transformer is characterized by comprising the following steps of:

(1) The input public data set is subjected to characteristic processing: after the original data is obtained, irrelevant features are required to be removed, category features in the data are encoded into identifiable digital forms, and the digital features are scaled according to standardized operation;

(2) Word Embedding is carried out on the feature vectors after feature processing, and the feature vectors are embedded into the Embedding: projecting the high-dimensional discrete data of the numerical features and the class features into a low-dimensional dense d-dimensional space by word Embedding before passing through an encoder of a transfomer+neural network;

(3) Inputting the word embedded coding vector obtained in the step (2) into two branches of a model: the model is divided into two branches of a transducer+neural network and an MLP+neural network, the feature vector of training data after word Embedding is input into the transducer+neural network for learning, the original output of the neural network is obtained, the same input is input into the MLP+neural network for modeling learning, a trained MLP+neural network is obtained, the transducer+neural network and the MLP+neural network are fused into a classification model, so that the two parts of original outputs are weighted and summed to form the integral output value of the model, and then the integral prediction result given by the classification model is obtained through an activation function;

(4) Training is guided by adopting Focal Loss as an objective function: training the classification model by utilizing the preprocessed training data, guiding the training process by adopting Focal Loss Focal Loss as an objective function, and searching the optimal parameters to obtain a trained classification model;

(5) Receive other tabular data for prediction: and preprocessing the form data to be classified, and inputting the form data to the trained classification model for classification prediction.

2. The method of claim 1, wherein the method of input feature processing of step (1) comprises the steps of:

(1-1) culling the useless features: carrying out feature recognition on each data set according to priori knowledge, and eliminating useless features;

(1-2) processing the sequential features: the continuous features are standardized by a standard scaler, and the numerical features are scaled;

(1-3) processing category characteristics: the class features are coded into digital form by a label coder LabelEncoder, so that the calculation cost is increased in order to avoid sparse coding, and single-hot one-hot coding is not carried out.

3. The method of claim 1, wherein the word Embedding method of step (2) is a technique of mapping feature vectors to low dimensional space vectors, wherein discrete feature vectors are converted into continuous vector representations, wherein general word Embedding method processing is performed for category features, wherein a separate full-connected layer is used for numerical features, wherein each numerical feature has ReLU nonlinearity, thereby projecting 1-dimensional input into d-dimensional space, and wherein Embedding of category features and numerical features is performed in a first dimension.

4. A method as claimed in claim 3, wherein: the embedding of the category characteristics and the numerical characteristics in the first dimension is connected, and the method specifically comprises the following steps: the embedding layer E embeds each feature into d-dimensional space, and in order to effectively process table data, the invention discriminates discrete type features and continuous numerical features. The invention obtains new embedded representation of category characteristics through word Embedding technology, obtains new embedded representation of numerical characteristics through using a full connection layer, and obtains x _i ＝[f _i ^{1} ,f _i ^{2} ,...,f _i ^{n} ]Is a single sample with class or numerical features, the embedding layer e uses different embedding functions for different types of features, for a givenObtain->Then splice in the feature dimension, E _Φ (X) is the result of all features being represented by the embedding:

E _Φ (x)＝{e _Φ1 (x ₁ ),...,e _ΦN (x _N )} (1)

5. the method of claim 1, wherein the model of step (3) comprises the following parts:

(3-1) the neural network fransformer+ is improved over the converter fransformer by adding a Leaky Gate before the encoder of the converter fransformer and adding an mlp+ neural network after, the Leaky Gate being a combination of two simple elements, namely a linear transformation based on element level and a Leaky relu activation function;

(3-2) the neural network mlp+ is improved relative to the multi-layer perceptron MLP by starting with a sub-block of the multi-layer perceptron MLP, replacing the normal Batch normalization Batch Norm with a Ghost normalization Ghost Batch Norm (GBN), adding a linear jump layer on the right side of the sub-block, the jump layer being just a fully connected linear layer, then a LeakyRelu activation function, and finally adding a Leaky Gate before the multi-layer perceptron MLP sub-block and the linear jump layer.

6. The method of claim 5, wherein: the steps (3-1) and (3-2) specifically comprise: (3-1) the feature vector outputted in the previous step is first entered into a leakage Gate, which is a combination of two simple elements, one linear transformation at element level, followed by a leakage Relu activation function that will let any positive value pass without change and compress any negative value to almost zero, in other words, if w _i And b _i Is the linear layer parameter of the ith column, then the leakage Gate leak Gate of the ith column is:

the Leaky Gate is intended to act as a simple filter, with different behavior for each column, with or without masking or passing depending on each individual value;

the converter layer takes the output of the Leaky Gate as input and passes the output to the second converter layer, and so on, as shown in FIG. 1, the output of the last converter layer will be directly input to the MLP+ neural network (modified multi-layer perceptron MLP), which is shown in FIG. 2, to obtain the model output value y _Transformer+ The method comprises the steps of carrying out a first treatment on the surface of the Wherein θ is ₁ ，θ ₂ And theta ₃ The model parameters of the Leaky Gate, transducer, mlp+ neural network are:

y _Transformer+ (x)＝M(f _transformer (G _Θ (E _Φ (x)；θ ₁ )；θ ₂ )；θ ₃ ) (3)

(3-2) similarly, the feature vector outputted in the previous stepInto MLP+ neural network (right branch of FIG. 1) to obtain model output value y _mlp+ ：

y _mlp+ ＝M(E _Φ (X)；θ ₁ ) (4)

7. The method of claim 1, wherein: the step (4) specifically comprises: to combine the improved converter and the improved multi-layer perceptron MLP to obtain predictions of the whole model and perform end-to-end training, the output values of the two modules are assigned different weights w ₁ And w ₂ (the two weights can be obtained by back propagation training learning), and the prediction probability of the final model outputAs in equation (5), σ represents an activation function, two is classified as sigmoid, and multiple is classified as softmax: