计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 255-260.doi: 10.11896/jsjkx.191000163
程婧1, 2, 刘娜娜1, 2, 闵可锐3, 康昱4, 王新1, 2, 周扬帆1, 2
CHENG Jing1, 2, LIU Na-na1, 2, MIN Ke-rui3, KANG Yu4, WANG Xin1, 2, ZHOU Yang-fan1, 2
摘要: 众多自然语言处理(Natural Language Processing, NLP)任务受益于在大规模语料上训练的词向量。由于预训练的词向量具有大语料上的通用语义特征, 因此将这些词向量应用到特定的下游任务时, 往往需要通过微调进行一定的更新和调整, 使其更适用于目标任务。但是, 目标语料集中的低频词由于缺少训练样本, 导致在微调过程中无法获得稳定的梯度信息, 使得词向量无法得到有效更新。而在短文本分类任务中, 这些低频词对分类结果同样有着重要的指示性。因此, 在具体的短文本分类任务上获得一个更好的低频词词向量表示是有必要的。针对这个问题, 文中提出了一种与下游任务模型无关的低频词词向量更新算法, 通过基于K近邻的词向量偏移计算方法, 利用通用词向量中与低频词相似的高频词所获得的任务特征信息, 来指导低频词的信息更新, 从而获得更准确的且适用于当前任务语境的低频词词向量表示;并以TextCNN作为基准模型, 基于word2vec和GloVe得到的两个通用预训练词向量, 在3个公开的短文本数据集上进行了优化算法的效果验证。实验结果表明, 使用优化算法更新低频词词表示后, 模型分类准确率能达到84.3%~94%, 较更新前提升了0.4%~1.4%, 体现了优化算法的有效性, 也进一步证明了短文本分类任务中低频词对分类结果的影响, 为短文本分类的研究工作提供了一定的借鉴。
中图分类号:
[1]MIKOLOV T, SUTSKEVER I, CHEN K, et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.New York:MIT Press, 2013:3111-3119. [2]MIKOLOV T, CHEN K, CORRADO G, et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781, 2013. [3]PENNINGTON J, SOCHER R, MANNING C.Glove:Globalvectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg:ACL, 2014:1532-1543. [4]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to)convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015. [5]CAMACHO-COLLADOS J, PILEHVAR M T, NAVIGLI R.Nasari:Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities[J].Artificia Intelligence, 2016, 240:36-64. [6]CALISKAN A, BRYSON J J, NARAYANAN A.Semantics derived automatically from language corpora contain human-like biases[J].Science, 2017, 356(6334):183-186. [7]WANG Y, HUANG M, ZHAO L.Attention-based LSTM foraspect-level sentiment classification[C]∥Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL, 2016:606-615. [8]LIU Y, LIU B, SHAN L, et al.Modelling context with neural networks for recommending idioms in essay writing[J].Neurocomputing, 2018, 275:2287-2293. [9]REZAEINIA S M, GHODSI A, RAHMANI R.Improving the accuracy of pre-trained word embeddings for sentiment analysis[J].arXiv:1711.08609, 2017. [10]KIM Y.Convolutional neural networks for sentence classification[J].arXiv:1408.5882, 2014. [11]LIU Y, LIU Z, CHUA T S, et al.Topical word embeddings[C]∥Twenty-Ninth AAAI Conference on Artificial Intelligence.Menlo Park:AAAI, 2015. [12]ZENG J, LI J, SONG Y, et al.Topic memory networks for short text classification[J].arXiv:1809.03664, 2018. [13]HUANG H, WANG Y, FENG C, et al.Leveraging Conceptualization for Short-Text Embedding[J].IEEE Transactions on Knowledge and Data Engineering, 2017, 30(7):1282-1295. [14]WANG J, WANG Z, ZHANG D, et al.Combining Knowledgewith Deep Convolutional Neural Networks for Short Text Classification[C]∥International Joint Conference on Artificial Intelligence.San Francisco:Morgan Kaufmann, 2017:2915-2921. [15]HUA W, WANG Z, WANG H, et al.Short text understanding through lexical-semantic analysis[C]∥2015 IEEE 31st International Conference on Data Engineering.Piscataway:IEEE, 2015:495-506. [16]MESNIL G, HE X, DEND L, et al.Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding[C]∥14th Annual Conference of the International Speech Communication Association.Lous Tourils:ISCA, 2013:3771-3775. [17]YANG X, MAO K.Supervised fine tuning for word embedding with integrated knowledge[J].arXiv:1505.07931, 2015. [18]ZHANG Y, WALLACE B.A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification[J].arXiv:1510.03820, 2015. [19]HEAP B, BAIN M, WOBCKE W, et al.Word vector enrichment of low frequency words in the bag-of-words model for short text multi-class classification problems[J].arXiv:1709.05778, 2017. [20]PETERSON L E.K-nearest neighbor[J].Scholarpedia, 2009, 4(2):1883. [21] LI X, ROTH D.Learning question classifiers[C]∥Proceedings of the 19th International Conference on Computational Linguistics-Volume 1.Association for Computational Linguistics, New York:ACM 2002:1-7. [22]VITALE D, FERRAGINA P, SCAIELLA U.Classification ofshort texts by deploying topical annotations[C]∥European Conference on Information Retrieval.Heidelberg:Springer, 2012:376-387. |
[1] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[2] | 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071 |
[3] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[4] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[5] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[6] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[7] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[8] | 杨进才, 曹元, 胡泉, 沈显君. 基于Transformer模型与关系词特征的汉语因果类复句关系自动识别 Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature 计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019 |
[9] | 杨青, 张亚文, 朱丽, 吴涛. 基于注意力机制和BiGRU融合的文本情感分析 Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU 计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075 |
[10] | 张玉帅, 赵欢, 李博. 基于BERT和BiLSTM的语义槽填充 Semantic Slot Filling Based on BERT and BiLSTM 计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088 |
[11] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述 Survey of Natural Language Processing Pre-training Techniques 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167 |
[12] | 霍丹, 张生杰, 万路军. 基于上下文的情感词向量混合模型 Context-based Emotional Word Vector Hybrid Model 计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114 |
[13] | 景丽, 李曼曼, 何婷婷. 结合扩充词典与自监督学习的网络评论情感分类 Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning 计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061 |
[14] | 杨丹浩,吴岳辛,范春晓. 一种基于注意力机制的中文短文本关键词提取模型 Chinese Short Text Keyphrase Extraction Model Based on Attention 计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261 |
[15] | 王乐乐,汪斌强,刘建港,张建辉,苗启广. 基于递归神经网络的恶意程序检测研究 Study on Malicious Program Detection Based on Recurrent Neural Network 计算机科学, 2019, 46(7): 86-90. https://doi.org/10.11896/j.issn.1002-137X.2019.07.013 |
|