[go: up one dir, main page]

CN111930792B - Labeling method and device for data resources, storage medium and electronic equipment - Google Patents

Labeling method and device for data resources, storage medium and electronic equipment Download PDF

Info

Publication number
CN111930792B
CN111930792B CN202010580828.4A CN202010580828A CN111930792B CN 111930792 B CN111930792 B CN 111930792B CN 202010580828 A CN202010580828 A CN 202010580828A CN 111930792 B CN111930792 B CN 111930792B
Authority
CN
China
Prior art keywords
target
knowledge point
word
vocabulary
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010580828.4A
Other languages
Chinese (zh)
Other versions
CN111930792A (en
Inventor
胡科
包英泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Yudi Technology Co ltd
Original Assignee
Beijing Dami Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dami Technology Co Ltd filed Critical Beijing Dami Technology Co Ltd
Priority to CN202010580828.4A priority Critical patent/CN111930792B/en
Publication of CN111930792A publication Critical patent/CN111930792A/en
Application granted granted Critical
Publication of CN111930792B publication Critical patent/CN111930792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a device for labeling data resources, a storage medium and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: the server pre-processes the original data resources to obtain text data, calculates the similarity between the text data and a plurality of target knowledge points to obtain a similarity value, generates a basic knowledge point label set of the original data resources according to a comparison result of the similarity value and a similarity threshold value, and generates a comprehensive knowledge point label set of the original data resources according to characteristic information of the original data resources and the basic knowledge point label set, so that knowledge point labels related to the original data resources can be marked quickly and accurately, and marking efficiency and marking accuracy are improved.

Description

数据资源的标注方法、装置、存储介质及电子设备Data resource labeling method, device, storage medium and electronic device

技术领域Technical Field

本申请涉及计算机技术领域,尤其涉及一种数据资源的标注方法、装置、存储介质及电子设备。The present application relates to the field of computer technology, and in particular to a data resource labeling method, device, storage medium and electronic device.

背景技术Background technique

随着互联网的发展,数据在互联网行业扮演着越来越重要的角色,例如:零售、交通、社交、搜索、教育、医疗等各个行业均涉及大规模的数据分析。以在线教育为例,在线教育场景中,工作人员通常需要分析用户的教学数据以获取用户的教学情况、学习情况,便于后续为用户提供更好的服务,而分析用户的学习数据过程需要获取用户已学习的数据资源上关联的知识点标签,类似的应用场景在其他领域也较为普遍。但在相关技术中,数据资源上关联的知识点标签通常是需要通过人工的方式提前进行标注,这种标注知识点标签的方式效率较低,且会受到标注人的主观因素的影响,导致不能准确地为数据资源标注上知识点标签。With the development of the Internet, data plays an increasingly important role in the Internet industry. For example, various industries such as retail, transportation, social networking, search, education, and medical care all involve large-scale data analysis. Take online education as an example. In online education scenarios, staff usually need to analyze users' teaching data to obtain users' teaching and learning situations, so as to provide better services to users in the future. The process of analyzing users' learning data requires obtaining the knowledge point labels associated with the data resources that users have learned. Similar application scenarios are also common in other fields. However, in related technologies, the knowledge point labels associated with data resources usually need to be manually labeled in advance. This method of labeling knowledge point labels is inefficient and will be affected by the subjective factors of the labeler, resulting in the inability to accurately label the data resources with knowledge point labels.

发明内容Summary of the invention

本申请实施例提供了一种数据资源的标注方法、装置、存储介质及电子设备,可以解决相关技术中对数据资源标注知识点标签不准确且效率低的问题。The embodiments of the present application provide a method, device, storage medium and electronic device for labeling data resources, which can solve the problem of inaccurate and inefficient labeling of knowledge points in data resources in related technologies.

所述技术方案如下:The technical solution is as follows:

第一方面,本申请实施例提供了一种数据资源的标注方法,所述方法包括:In a first aspect, an embodiment of the present application provides a method for labeling data resources, the method comprising:

对原始数据资源进行预处理获取文本数据;Preprocess the original data resources to obtain text data;

将所述文本数据分别和多个目标知识点进行相似度计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;Calculating similarity between the text data and multiple target knowledge points to obtain similarity values; wherein each of the multiple target knowledge points is associated with a basic knowledge point label;

根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;Generate a basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold; wherein the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold;

根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。A comprehensive knowledge point label set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point label set.

第二方面,本申请实施例提供了一种数据资源的标注装置,所述数据资源的标注装置包括:In a second aspect, an embodiment of the present application provides a data resource annotation device, the data resource annotation device comprising:

预处理模块,用于对原始数据资源进行预处理获取文本数据;A preprocessing module is used to preprocess the original data resources to obtain text data;

计算模块,用于将所述文本数据分别和多个目标知识点进行相似度计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;A calculation module, used for calculating similarity between the text data and a plurality of target knowledge points to obtain similarity values; wherein each of the plurality of target knowledge points is associated with a basic knowledge point label;

第一处理模块,用于根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;A first processing module is used to generate a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold; wherein the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold;

第二处理模块,用于根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。The second processing module is used to generate a comprehensive knowledge point label set of the original data resource according to the feature information of the original data resource and the basic knowledge point label set.

第三方面,本申请实施例提供一种计算机存储介质,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行上述的方法步骤。In a third aspect, an embodiment of the present application provides a computer storage medium, wherein the computer storage medium stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the above-mentioned method steps.

第四方面,本申请实施例提供一种电子设备,可包括:处理器和存储器;其中,所述存储器存储有计算机程序,所述计算机程序适于由所述处理器加载并执行上述的方法步骤。In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program, and the computer program is suitable for being loaded by the processor and executing the above-mentioned method steps.

本申请一些实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought about by the technical solutions provided by some embodiments of the present application include at least:

本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合,可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the scheme of the embodiment of the present application is executed, the server preprocesses the original data resource to obtain text data, calculates the similarity between the text data and multiple target knowledge points to obtain similarity values, generates a basic knowledge point label set of the original data resource based on the comparison result of the similarity value and the similarity threshold, and generates a comprehensive knowledge point label set of the original data resource based on the feature information of the original data resource and the basic knowledge point label set. The original data resource can be quickly and accurately labeled with relevant knowledge point labels, thereby improving the efficiency and accuracy of labeling.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请实施例提供的一种系统架构图;FIG1 is a system architecture diagram provided by an embodiment of the present application;

图2是本申请实施例提供的数据资源的标注方法的流程示意图;FIG2 is a schematic diagram of a flow chart of a method for labeling data resources provided in an embodiment of the present application;

图3是本申请实施例提供的数据资源的标注方法的另一流程示意图;FIG3 is another schematic diagram of a flow chart of a method for labeling data resources provided in an embodiment of the present application;

图4是本申请实施例提供的数据资源的标注方法的相似度计算流程示意图;FIG4 is a schematic diagram of a similarity calculation process of a data resource annotation method provided in an embodiment of the present application;

图5是本申请实施例提供的一种装置的结构示意图;FIG5 is a schematic diagram of the structure of a device provided in an embodiment of the present application;

图6是本申请实施例提供的一种装置的结构示意图。FIG6 is a schematic diagram of the structure of a device provided in an embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

图1示出了可以应用本申请实施例的数据资源的标注方法或数据资源的标注装置的示例性系统架构100的示意图。FIG. 1 is a schematic diagram showing an exemplary system architecture 100 to which a data resource annotation method or a data resource annotation device according to an embodiment of the present application can be applied.

如图1所示,系统架构100可以包括终端设备101、102、103中的一种或多种,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质,终端设备101、102、103上可以安装有各种通信客户端应用,例如:视频录制应用、视频播放应用、语音交互应用、搜索类应用、及时通信工具、邮箱客户端、社交平台软件等。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG1 , the system architecture 100 may include one or more of terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 is used to provide a medium for a communication link between the terminal devices 101, 102, and 103 and the server 105. Various communication client applications may be installed on the terminal devices 101, 102, and 103, such as video recording applications, video playback applications, voice interaction applications, search applications, instant messaging tools, email clients, social platform software, etc. The network 104 may include various connection types, such as wired or wireless communication links or optical fiber cables, etc.

用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。网络104可以包括各种类型的有线通信链路或无线通信链路,例如:有线通信链路包括光纤、双绞线或同轴电缆的,无线通信链路包括蓝牙通信链路、无线保真(WIreless-FIdelity,Wi-Fi)通信链路或微波通信链路等。终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为软件时,可以是安装于上述所列举的电子设备中。其可以实现呈多个软件或软件模块(例如:用来提供分布式服务),也可以实现成单个软件或软件模块,在此不作具体限定。当终端设备101、102、103为硬件时,其上还可以安装有显示设备和摄像头,显示设备显示可以是各种能实现显示功能的设备,摄像头用于采集视频流;例如:显示设备可以是阴极射线管显示器(Cathode raytubedisplay,简称CR)、发光二极管显示器(Light-emitting diode display,简称LED)、电子墨水屏、液晶显示屏(Liquid crystal display,简称LCD)、等离子显示面板(Plasmadisplaypanel,简称PDP)等。用户可以利用终端设备101、102、103上的显示设备,来查看显示的文字、图片、视频等信息。Users can use terminal devices 101, 102, 103 to interact with server 105 through network 104 to receive or send messages, etc. Terminal devices 101, 102, 103 can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, portable computers, desktop computers, etc. Network 104 can include various types of wired communication links or wireless communication links, for example: wired communication links include optical fibers, twisted pairs or coaxial cables, and wireless communication links include Bluetooth communication links, wireless fidelity (WIreless-FIdelity, Wi-Fi) communication links or microwave communication links, etc. Terminal devices 101, 102, 103 can be hardware or software. When terminal devices 101, 102, 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (for example: used to provide distributed services), or it can be implemented as a single software or software module, which is not specifically limited here. When the terminal devices 101, 102, and 103 are hardware, a display device and a camera may be installed thereon. The display device may be any device capable of realizing a display function, and the camera is used to collect video streams; for example, the display device may be a cathode ray tube display (CR), a light-emitting diode display (LED), an electronic ink screen, a liquid crystal display (LCD), a plasma display panel (PDP), etc. The user may use the display device on the terminal devices 101, 102, and 103 to view displayed text, pictures, videos, and other information.

需要说明的是,本申请实施例提供的数据资源的标注方法通常由服务器105执行,相应的,数据资源的标注装置通常设置于服务器105中。服务器105可以是提供各种服务的服务器,服务器105可以是硬件,也可以是软件。当服务器105为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器105为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或软件模块,在此不做具体限定。It should be noted that the data resource annotation method provided in the embodiment of the present application is usually performed by the server 105, and accordingly, the data resource annotation device is usually arranged in the server 105. The server 105 can be a server that provides various services, and the server 105 can be hardware or software. When the server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or it can be implemented as a single server. When the server 105 is software, it can be implemented as multiple software or software modules (for example, used to provide distributed services), or it can be implemented as a single software or software module, which is not specifically limited here.

本申请中的服务器105可以为提供各种服务的终端设备,如:服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。The server 105 in the present application can be a terminal device that provides various services, such as: the server pre-processes the original data resources to obtain text data, calculates the similarity between the text data and multiple target knowledge points to obtain similarity values, generates a basic knowledge point label set of the original data resources based on the comparison results of the similarity values and the similarity thresholds, and generates a comprehensive knowledge point label set of the original data resources based on the feature information of the original data resources and the basic knowledge point label set.

在此需要说明的是,本申请实施例所提供的数据资源的标注方法可以由终端设备101、102、103中的一个或多个,和/或,服务器105执行,相应地,本申请实施例所提供的数据资源的标注装置一般设置于对应终端设备中,和/或,服务器105中,但本申请不限于此。It should be noted here that the data resource labeling method provided in the embodiment of the present application can be executed by one or more of the terminal devices 101, 102, 103, and/or the server 105. Accordingly, the data resource labeling device provided in the embodiment of the present application is generally set in the corresponding terminal device, and/or the server 105, but the present application is not limited to this.

应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements.

下面将结合附图2至附图4,对本申请实施例提供的数据资源的标注方法进行详细介绍。在这里需要说明的是,为了方便描述,实施例以在线教育行业为例进行说明,但本领域技术人员明白,本申请的适用并不局限于在线教育行业,本申请所描述的数据资源的标注方法可以有效应用于互联网各个行业领域。The following will introduce the data resource annotation method provided by the embodiment of the present application in detail in conjunction with Figures 2 to 4. It should be noted here that for the convenience of description, the embodiment is described by taking the online education industry as an example, but those skilled in the art understand that the application of the present application is not limited to the online education industry, and the data resource annotation method described in the present application can be effectively applied to various industries and fields of the Internet.

请参见图2,为本申请实施例提供了一种数据资源的标注方法的流程示意图。如图2所示,本申请实施例的所述方法可以包括以下步骤:Please refer to Figure 2, which is a flow chart of a method for labeling data resources provided in an embodiment of the present application. As shown in Figure 2, the method in the embodiment of the present application may include the following steps:

S201,对原始数据资源进行预处理获取文本数据。S201, pre-processing the original data resources to obtain text data.

其中,原始数据资源是指以文本、音频、视频等类型存在的数据资源,如可以包括习题、绘本、学习音频、学习视频等数据资源,原始数据资源中包含与之对应的学习级别和科目。文本数据是指将文本、音频、视频等类型的原始数据资源统一转换成文本类型后的数据。Among them, raw data resources refer to data resources in the form of text, audio, video, etc., such as exercises, picture books, learning audio, learning video, etc. The raw data resources contain the corresponding learning levels and subjects. Text data refers to data after raw data resources of text, audio, video, etc. are uniformly converted into text type.

一般的,在原始数据资源为音频或视频类型时,可通过ASR(Automatic SpeechRecognition,自动语音识别)技术将原始数据资源转化为预设文本类型的文本数据。ASR技术是一种基于关键词语列表将音频转换为文本的技术,将音频(或视频中的音频)内容通过频谱转换为语音特征,并将该语音特征与关键词语列表中的条目进行匹配,将得到的最优匹配结果作为识别结果。在原始数据资源为文本类型,但非预设文本类型时,则需要将该文本类型转化为预设文本类型。常见的文本类型有:txt.、doc.、hlp.、wps.、rtf.、htm.、pdf等,预设文本类型可根据实际需要设定不同的文本类型。Generally, when the original data resource is an audio or video type, the original data resource can be converted into text data of a preset text type by ASR (Automatic Speech Recognition) technology. ASR technology is a technology that converts audio into text based on a keyword list, converts the audio (or audio in a video) content into speech features through a spectrum, and matches the speech features with the entries in the keyword list, and uses the best matching result obtained as the recognition result. When the original data resource is a text type, but not a preset text type, it is necessary to convert the text type into a preset text type. Common text types include: txt., doc., hlp., wps., rtf., htm., pdf, etc., and the preset text type can set different text types according to actual needs.

S202,将文本数据分别和多个目标知识点进行相似度计算得到相似度值。S202, calculating similarity between the text data and a plurality of target knowledge points respectively to obtain similarity values.

其中,目标知识点包括目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法中的至少一种,是可以从知识图谱中获取的知识点,不同学习级别对应不同的目标知识点,多个目标知识点各自关联有一个基础知识点标签。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似。Among them, the target knowledge points include at least one of target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbols, target sentence patterns and target grammar, which are knowledge points that can be obtained from the knowledge graph. Different learning levels correspond to different target knowledge points, and multiple target knowledge points are each associated with a basic knowledge point label. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are.

一般的,在将文本数据分别和多个目标知识点进行相似度计算得到相似度值之前,需要获取原始数据资源对应的课程信息,从预设的知识图谱中查询课程信息得到与之对应的多个目标知识点,知识图谱中包含不同学习级别、不同年段的目标知识点。对文本数据分别和多个目标知识点进行相似度计算,也即计算文本数据中的基础知识点与目标知识点的相似度值,可基于该相似度值可判定该文本数据对应的基础知识点标签,也即原始数据资源对应的基础知识点标签。文本数据中的基础知识点可以包括参照内容词汇、参照高频词汇、参照动词词汇、参照数学词汇、参照音标、参照句式和参照语法中的一种或多种。Generally, before calculating the similarity between the text data and multiple target knowledge points to obtain similarity values, it is necessary to obtain the course information corresponding to the original data resource, query the course information from the preset knowledge graph to obtain multiple target knowledge points corresponding thereto, and the knowledge graph contains target knowledge points of different learning levels and different grades. Calculate the similarity between the text data and multiple target knowledge points, that is, calculate the similarity values between the basic knowledge points in the text data and the target knowledge points, and based on the similarity values, determine the basic knowledge point labels corresponding to the text data, that is, the basic knowledge point labels corresponding to the original data resource. The basic knowledge points in the text data may include one or more of reference content vocabulary, reference high-frequency vocabulary, reference verb vocabulary, reference mathematical vocabulary, reference phonetic symbols, reference sentence patterns, and reference grammar.

S203,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合。S203: Generate a basic knowledge point tag set of the original data resource according to the comparison result between the similarity value and the similarity threshold.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值。基础知识点标签集合是包含原始数据资源对应的基础知识点标签的集合,可以包括与目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法各自关联的知识点标签,基础知识点标签集合包括的基础知识点标签为相似度值大于相似度阈值的目标知识点关联的基础知识点标签。The similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. The basic knowledge point label set is a set of basic knowledge point labels corresponding to the original data resources, which may include knowledge point labels associated with target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbols, target sentence patterns, and target grammar. The basic knowledge point labels included in the basic knowledge point label set are basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold.

一般的,根据基础知识点不同以及相似度值计算方法的不同,与之对应的相似度值、相似度阈值、基础知识点标签均不相同,在目标知识点为目标内容词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,计算参照内容词汇与目标内容词汇的相似度值,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签,将目标内容词汇关联的基础知识点标签加入到基础知识点标签集合中。Generally, depending on the different basic knowledge points and the different similarity value calculation methods, the corresponding similarity values, similarity thresholds, and basic knowledge point labels are all different. When the target knowledge point is a target content vocabulary, a basic knowledge point label set of the original data resource is generated based on the comparison results of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set, calculating the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm, taking the word block with an importance weight greater than a first preset weight as a reference content vocabulary, calculating the similarity value between the reference content vocabulary and the target content vocabulary, and when the similarity value is greater than the similarity threshold, obtaining the basic knowledge point label associated with the target content vocabulary, and adding the basic knowledge point label associated with the target content vocabulary to the basic knowledge point label set.

在目标知识点为目标高频词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇,计算参照高频词汇与目标高频词汇的相似度值,在相似度值大于相似度阈值时,获取目标高频词汇关联的基础知识点标签,将目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is a target high-frequency word, a basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set, calculating the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm, taking the word block with an importance weight less than or equal to a second preset weight as a reference high-frequency word, calculating the similarity value between the reference high-frequency word and the target high-frequency word, and when the similarity value is greater than the similarity threshold, obtaining the basic knowledge point label associated with the target high-frequency word, and adding the basic knowledge point label associated with the target high-frequency word to the basic knowledge point label set.

在目标知识点为目标动词词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为动词词性的词语作为参照动词词汇,计算参照动词词汇与目标动词词汇的相似度值,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,将目标动词词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is a target verb vocabulary, a basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, performing word segmentation processing on the sentence set to obtain a word set, performing part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, taking words with the verb part-of-speech as reference verb vocabulary, calculating the similarity value between the reference verb vocabulary and the target verb vocabulary, obtaining the basic knowledge point label associated with the target verb vocabulary when the similarity value is greater than the similarity threshold, and adding the basic knowledge point label associated with the target verb vocabulary to the basic knowledge point label set.

在目标知识点为目标数学词汇时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为数词词性的词语作为参照数学词汇,计算参照数学词汇与目标数学词汇的相似度值,在相似度值大于相似度阈值时,获取目标数学词汇关联的基础知识点标签,将目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is a target mathematical vocabulary, a basic knowledge point label set of the original data resource is generated according to a comparison result of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, performing word segmentation processing on the sentence set to obtain a word set, performing part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, taking words whose part of speech is a numeral as a reference mathematical vocabulary, calculating a similarity value between the reference mathematical vocabulary and the target mathematical vocabulary, obtaining a basic knowledge point label associated with the target mathematical vocabulary when the similarity value is greater than the similarity threshold, and adding the basic knowledge point label associated with the target mathematical vocabulary to the basic knowledge point label set.

在目标知识点为目标音标时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合,计算音标集合中的词语音标与目标音标的相似度值,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签,将目标音标关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is a target phonetic symbol, a basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, performing word segmentation processing on the sentence set to obtain a word set, analyzing each word in the word set, and marking each word with a phonetic symbol to obtain a phonetic symbol set, calculating the similarity value between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol, and when the similarity value is greater than the similarity threshold, obtaining the basic knowledge point label associated with the target phonetic symbol, and adding the basic knowledge point label associated with the target phonetic symbol to the basic knowledge point label set.

在目标知识点为目标句式时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,对句子集合中的句子进行依存句法分析得到依存句法树,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签,将目标句式关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is a target sentence pattern, a basic knowledge point label set of the original data resource is generated according to a comparison result between the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, performing word segmentation processing on the sentence set to obtain a word set, performing part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, performing dependency syntactic analysis on the sentences in the sentence set to obtain a dependency syntactic tree, calculating similarity values corresponding to the words in the word set, the parts of speech in the part-of-speech tagging set, and the dependency syntactic tree with the words, parts of speech, and syntactic trees in the target sentence pattern, respectively; when the similarity value is greater than the similarity threshold, obtaining the basic knowledge point label associated with the target sentence pattern, and adding the basic knowledge point label associated with the target sentence pattern to the basic knowledge point label set.

在目标知识点为目标语法时,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,包括:对文本数据进行句子分割处理得到句子集合,并对句子集合进行分词处理得到词语集合,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签,将目标语法关联的基础知识点标签加入到基础知识点标签集合中。When the target knowledge point is the target grammar, a basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold, including: performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set, based on the word set, performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set, and calculating the similarity value between the grammar contained in the sentence set and the target grammar based on the word set and the part-of-speech tagging set, when the similarity value is greater than the similarity threshold, obtaining the basic knowledge point label associated with the target grammar, and adding the basic knowledge point label associated with the target grammar to the basic knowledge point label set.

S204,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。S204: Generate a comprehensive knowledge point label set of the original data resource according to the feature information of the original data resource and the basic knowledge point label set.

其中,特征信息是指原始数据资源的类型,如原始数据资源的类型可以包括音频、视频、文本、绘本等类型,不同的类型能训练/锻炼的学习能力(如:听、说、读、写能力)不同。综合知识点标签集合中的综合知识点标签是指基于原始数据资源的特征信息和基础知识点标签集合生成的能反应该原始数据资源锻炼用户(学生)能力的知识点标签,一种原始数据资源对应的综合知识点标签可以有多个,如:原始数据资源包含音频,且基础知识点标签集合中含有与目标音标相关联的知识点标签,则可分析得到该原始数据资源对应的综合知识点标签为听力标签。Among them, the characteristic information refers to the type of the original data resource, such as the type of the original data resource can include audio, video, text, picture book and other types, and different types can train/exercise different learning abilities (such as listening, speaking, reading and writing abilities). The comprehensive knowledge point label in the comprehensive knowledge point label set refers to the knowledge point label that can reflect the ability of the original data resource to train the user (student) based on the characteristic information of the original data resource and the basic knowledge point label set. There can be multiple comprehensive knowledge point labels corresponding to one original data resource. For example, if the original data resource contains audio, and the basic knowledge point label set contains knowledge point labels associated with the target phonetic symbol, then it can be analyzed that the comprehensive knowledge point label corresponding to the original data resource is a listening label.

一般的,原始数据资源的综合知识点标签与原始数据资源自身的特征有关,综合知识点标签集合中的综合知识点标签与基础知识点标签集合中的基础知识点标签均与原始数据资源相关联,在基础知识点标签集合和综合知识点标签集合生成后,也即表明原始数据资源已标注上了相关的知识点标签(包括基础知识点标签和综合知识点标签)。Generally, the comprehensive knowledge point labels of the original data resources are related to the characteristics of the original data resources themselves. The comprehensive knowledge point labels in the comprehensive knowledge point label set and the basic knowledge point labels in the basic knowledge point label set are both associated with the original data resources. After the basic knowledge point label set and the comprehensive knowledge point label set are generated, it means that the original data resources have been marked with relevant knowledge point labels (including basic knowledge point labels and comprehensive knowledge point labels).

本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,将文本数据分别和多个目标知识点进行相似度计算得到相似度值,根据相似度值和相似度阈值的比较结果生成原始数据资源的基础知识点标签集合,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合,可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the scheme of the embodiment of the present application is executed, the server preprocesses the original data resource to obtain text data, calculates the similarity between the text data and multiple target knowledge points to obtain similarity values, generates a basic knowledge point label set of the original data resource based on the comparison result of the similarity value and the similarity threshold, and generates a comprehensive knowledge point label set of the original data resource based on the feature information of the original data resource and the basic knowledge point label set. The original data resource can be quickly and accurately labeled with relevant knowledge point labels, thereby improving the efficiency and accuracy of labeling.

正如前面描述,本申请实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiments of the present application are mainly described by taking the online education industry as an example, but those skilled in the art understand that the applicability of the present method is not limited to the online education industry. For example, the method described in the present application can be applied to user tag processing in various industries such as retail, transportation, social networking, search, education, and medical care.

请参见图3,为本申请实施例提供了一种数据资源的标注方法的流程示意图,该数据资源的标注方法可以包括以下步骤:Please refer to FIG3 , which is a flowchart of a method for labeling data resources provided in an embodiment of the present application. The method for labeling data resources may include the following steps:

S301,对原始数据资源进行预处理获取文本数据。S301, preprocessing the original data resources to obtain text data.

其中,原始数据资源是指以文本、音频、视频等类型存在的数据资源,如可以包括习题、绘本、学习音频、学习视频等教学资源,原始数据资源中包含与之对应的学习级别和科目。文本数据是指将文本、音频、视频等类型的原始数据资源统一转换成文本类型后的数据。文本数据中的基础知识点可以包括参照内容词汇、参照高频词汇、参照动词词汇、参照数学词汇、参照音标、参照句式和参照语法中的一种或多种。Among them, raw data resources refer to data resources in the form of text, audio, video, etc., such as exercises, picture books, learning audio, learning video and other teaching resources, and the raw data resources contain the corresponding learning levels and subjects. Text data refers to the data after raw data resources of text, audio, video and other types are uniformly converted into text type. The basic knowledge points in the text data can include one or more of reference content vocabulary, reference high-frequency vocabulary, reference verb vocabulary, reference mathematical vocabulary, reference phonetic symbols, reference sentence patterns and reference grammar.

一般的,在原始数据资源为音频或视频类型时,可通过ASR(Automatic SpeechRecognition,自动语音识别)技术将原始数据资源转化为预设文本类型的文本数据。ASR技术是一种基于关键词语列表将音频转换为文本的技术,将音频(或视频中的音频)内容通过频谱转换为语音特征,并将该语音特征与关键词语列表中的条目进行匹配,将得到的最优匹配结果作为识别结果。在原始数据资源为文本类型,但非预设文本类型时,则需要将该文本类型转化为预设文本类型。常见的文本类型有:txt.、doc.、hlp.、wps.、rtf.、htm.、pdf等,预设文本类型可根据实际需要设定不同的文本类型。Generally, when the original data resource is an audio or video type, the original data resource can be converted into text data of a preset text type by ASR (Automatic Speech Recognition) technology. ASR technology is a technology that converts audio into text based on a keyword list, converts the audio (or audio in a video) content into speech features through a spectrum, and matches the speech features with the entries in the keyword list, and uses the best matching result obtained as the recognition result. When the original data resource is a text type, but not a preset text type, it is necessary to convert the text type into a preset text type. Common text types include: txt., doc., hlp., wps., rtf., htm., pdf, etc., and the preset text type can set different text types according to actual needs.

S302,从预设知识图谱中查询原始数据资源对应的属性信息,得到与属性信息对应的多个目标知识点。S302, querying the attribute information corresponding to the original data resource from the preset knowledge graph to obtain multiple target knowledge points corresponding to the attribute information.

其中,原始数据资源为教学资源,属性信息为课程信息,课程信息是指与原始数据资源对应的学习级别、学习科目等课程相关的信息。目标知识点包括目标内容词汇、目标高频词汇、目标动词词汇、目标数学词汇、目标音标、目标句式和目标语法中的至少一种,是可以从知识图谱中获取的知识点,不同学习级别对应不同的目标知识点,多个目标知识点各自关联有一个基础知识点标签。知识图谱中包含不同学习级别、不同年段、不同学习科目的目标知识点,也理解为知识域可视化或知识领域映射地图,是显示知识发展进程与结构关系的一系列各种不同的图形数据,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。Among them, the original data resources are teaching resources, and the attribute information is course information. Course information refers to information related to courses such as learning levels and learning subjects corresponding to the original data resources. Target knowledge points include at least one of target content vocabulary, target high-frequency vocabulary, target verb vocabulary, target mathematical vocabulary, target phonetic symbols, target sentence patterns, and target grammar. They are knowledge points that can be obtained from the knowledge graph. Different learning levels correspond to different target knowledge points, and multiple target knowledge points are each associated with a basic knowledge point label. The knowledge graph contains target knowledge points for different learning levels, different grades, and different learning subjects. It is also understood as a knowledge domain visualization or knowledge domain mapping map. It is a series of various graphic data that show the knowledge development process and structural relationship. It uses visualization technology to describe knowledge resources and their carriers, and mine, analyze, construct, draw and display knowledge and their mutual connections.

一般的,在将文本数据分别和多个目标知识点进行相似度计算得到相似度值之前,需要获取原始数据资源对应的课程信息,从预设的知识图谱中查询课程信息得到与之对应的多个目标知识点,知识图谱中包含不同学习级别、不同年段、不同学习科目的目标知识点。对文本数据分别和多个目标知识点进行相似度计算,也即计算文本数据中的基础知识点与目标知识点的相似度值,可基于该相似度值可判定该文本数据对应的基础知识点标签,也即原始数据资源对应的基础知识点标签。根据原始数据资源的不同,可获取与之对应的不同课程信息,基于的课程信息可从预设的知识图谱中查询到与原始数据资源对应的多个目标知识点,但原始数据资源可能不能完全包含从预设的知识图谱中查询到的全部目标知识点,原始数据资源可能会包含一种或多种从预设的知识图谱中查询到的目标知识点,故需要分析原始数据资源中基础知识点与预设知识图谱中目标知识点的相似度值,基于相似度值可确定原始数据资源包含的目标知识点,进而确定原始数据资源可以关联的基础知识点标签。Generally, before calculating the similarity between the text data and multiple target knowledge points to obtain similarity values, it is necessary to obtain the course information corresponding to the original data resource, query the course information from the preset knowledge graph to obtain multiple corresponding target knowledge points, and the knowledge graph contains target knowledge points of different learning levels, different grades, and different learning subjects. Calculate the similarity between the text data and multiple target knowledge points, that is, calculate the similarity values between the basic knowledge points in the text data and the target knowledge points, and based on the similarity values, determine the basic knowledge point labels corresponding to the text data, that is, the basic knowledge point labels corresponding to the original data resources. Depending on the original data resources, different corresponding course information can be obtained. Based on the course information, multiple target knowledge points corresponding to the original data resources can be queried from the preset knowledge graph, but the original data resources may not completely contain all the target knowledge points queried from the preset knowledge graph. The original data resources may contain one or more target knowledge points queried from the preset knowledge graph. Therefore, it is necessary to analyze the similarity values between the basic knowledge points in the original data resources and the target knowledge points in the preset knowledge graph. Based on the similarity values, the target knowledge points contained in the original data resources can be determined, and then the basic knowledge point labels that can be associated with the original data resources can be determined.

S303,对文本数据进行句子分割处理得到句子集合,并对句子集合分别进行分块处理得到词语块集合,以及分词处理得到词语集合。S303, performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set, and performing word segmentation processing on the sentence set to obtain a word set.

其中,句子集合是指对文本数据进行句子分割处理后得到包含多个句子的集合,句子集合中的句子是根据文本数据的文本内容、换行符、标点符号等进行分割处理得到的一个或多个完整的句子。词语块集合是指分别对句子集合中各个句子进行短语划分得到的包括多个短语(词语块)的集合。词语集合是指分别对句子集合中各个句子进行词语划分得到的包括多个词语的集合。Among them, the sentence set refers to a set of multiple sentences obtained by performing sentence segmentation processing on text data, and the sentences in the sentence set are one or more complete sentences obtained by segmentation processing based on the text content, line breaks, punctuation marks, etc. of the text data. The word block set refers to a set of multiple phrases (word blocks) obtained by performing phrase segmentation on each sentence in the sentence set. The word set refers to a set of multiple words obtained by performing word segmentation on each sentence in the sentence set.

S304,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值。S304, calculating the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm.

其中,重要程度权值是指基于关键词提取TF-IDF算法对词语块进行分析后得到与各个词语块对应的重要程度的权重,也即TF-IDF值,权值指加权平均数中的每个数的频数,也称为权数或权重。Among them, the importance weight refers to the importance weight corresponding to each word block obtained after analyzing the word block based on the keyword extraction TF-IDF algorithm, that is, the TF-IDF value, and the weight refers to the frequency of each number in the weighted average, also known as the weight or weight.

一般的,TF-IDF(Term Frequency–Inverse Document Frequency)是用于信息检索与数据挖掘的常用加权技术,TF-IDF是一种统计方法,用于评估字词对于一个文件集或一个语料库中的其中一份文件的重要程度,字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF代表词频(TermFrequency),表示语料出现的次数除以该问答库中的总句数,IDF代表权重,表示词语块出现的逆文档频率。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。Generally, TF-IDF (Term Frequency–Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining. TF-IDF is a statistical method used to evaluate the importance of a word for a document set or one of the documents in a corpus. The importance of a word increases in direct proportion to the number of times it appears in the document, but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus. TF stands for term frequency, which means the number of times a corpus appears divided by the total number of sentences in the question-answer database. IDF stands for weight, which means the inverse document frequency of a word block. The main idea of IDF is: if there are fewer documents containing term t, that is, the smaller n is, the larger the IDF is, it means that term t has a good ability to distinguish categories. If the number of documents containing term t in a certain category of documents C is m, and the total number of documents containing t in other categories is k, it is obvious that the number of all documents containing t is n=m+k. When m is large, n is also large, and the IDF value obtained according to the IDF formula will be small, which means that the term t has a poor ability to distinguish categories.

基于关键词提取TF-IDF算法的计算过程主要包括:计算词频,针对文本长短不同的文本内容进行比较,对词语块进行"词频"标准化;计算逆文档频率,利用语料库(corpus)模拟语言的使用环境,若一个字词出现次数越多,则分母越大,逆文档频率就越小(越接近0),对分母加1以避免出现分母为0(即所有文档都不包含该词)的情况;计算TF-IDF值,TF-IDF值与一个字词在文本中的出现次数成正比,与该字词在整个语言中的出现次数成反比,在计算出文本中每个字词的TF-IDF值后,可根据TF-IDF值剪降序排列。The calculation process of the TF-IDF algorithm based on keyword extraction mainly includes: calculating word frequency, comparing text contents of different lengths, and standardizing the "word frequency" of word blocks; calculating inverse document frequency, using a corpus to simulate the language usage environment. If a word appears more times, the larger the denominator, the smaller the inverse document frequency (the closer to 0), and the denominator is added by 1 to avoid the situation where the denominator is 0 (that is, all documents do not contain the word); calculating the TF-IDF value. The TF-IDF value is proportional to the number of times a word appears in the text and inversely proportional to the number of times the word appears in the entire language. After calculating the TF-IDF value of each word in the text, it can be arranged in descending order according to the TF-IDF value.

S305,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇。S305: taking word blocks whose importance weight is greater than a first preset weight as reference content words, and taking word blocks whose importance weight is less than or equal to a second preset weight as reference high-frequency words.

其中,第一预设取值是筛选参照内容词汇的依据,第一预设权值和第一预设权值均可根据实际需要设定。参照内容词汇是指包含一定含义的词语快。第二预设权值是筛选参照高频词汇的依据,通常第一预设权值小于第二预设权值。The first preset value is the basis for screening reference content vocabulary, and the first preset weight and the second preset weight can be set according to actual needs. Reference content vocabulary refers to a phrase containing a certain meaning. The second preset weight is the basis for screening reference high-frequency vocabulary, and usually the first preset weight is less than the second preset weight.

一般的,在基于关键词提取TF-IDF算法对词语块集合中的各个词语块进行重要程度打分后,可按照降序排序重要程度权值的词语块,取降序排序的前1/3的词语块作为参照内容词汇,取降序排序的后1/3的词语块作为参照高频词汇。Generally, after scoring the importance of each word block in the word block set based on the keyword extraction TF-IDF algorithm, the word blocks with importance weights can be sorted in descending order, and the first 1/3 of the word blocks in descending order are taken as reference content words, and the last 1/3 of the word blocks in descending order are taken as reference high-frequency words.

S306,计算参照内容词汇与目标内容词汇的相似度值,以及参照高频词汇与目标高频词汇的相似度值。S306, calculating the similarity value between the reference content vocabulary and the target content vocabulary, and the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary.

其中,相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是参照内容词汇与目标内容词汇的相似度值和参照高频词汇与目标高频词汇的相似度值。目标内容词汇和目标高频词汇均是知识图谱中与原始数据资源的课程信息对应的目标知识点,参照内容词汇和参照高频词汇是原始数据资源的文本数据中的基础知识点。Among them, the similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity value between the reference content vocabulary and the target content vocabulary and the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary. The target content vocabulary and the target high-frequency vocabulary are both target knowledge points in the knowledge graph corresponding to the course information of the original data resource, and the reference content vocabulary and the reference high-frequency vocabulary are basic knowledge points in the text data of the original data resource.

S307,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签。S307 , when the similarity value is greater than the similarity threshold, obtaining basic knowledge point labels associated with the target content vocabulary and basic knowledge point labels associated with the target high-frequency vocabulary.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,参照内容词汇对应的相似度阈值与参照高频词汇对应的相似度阈值不同。基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. Depending on the basic knowledge points, the corresponding similarity thresholds may also be different. The similarity threshold corresponding to the reference content vocabulary is different from the similarity threshold corresponding to the reference high-frequency vocabulary. The basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge points of the original data resource contain basic knowledge points whose similarity values with the target knowledge points in the knowledge graph are greater than the similarity threshold, then the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning abilities involved in the data resources, and by marking data resources with different labels, the screening and analysis of data resources can be achieved.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分块(Chunk)得到各个句子对应的词语块集合{B11,B12…B1m1},{B21,B22…}…,由于参照内容词汇大多是基于学习主题的词汇,为加快计算速度,采用基于关键词提取TF-IDF算法对每一个词语块集合{B11,B12…B1m1},{B21,B22…}…进行重要度打分得到各个词语块集合中词语块的重要程度权值,将重要程度权值按降序排序后,取其前1/3的词语块作为参照内容词汇,即将重要程度权值大于第一预设权值的词语块作为参照内容词汇,并依次与目标内容词汇进行相似度计算,参照内容词汇记为Bi,目标内容词汇记为Kj,分别计算Bi与Kj的编辑距离相似度sim_raw1,词性还原之后编辑距离相识度sim_lemma1以及语义相似度sim_sem1,根据上述相似度计算出总相似度值voc_score1=vocα·sim_raw+vocβ··sim_lemma+vocγ·sim_sem,如果总相似度值高于第一预设相似度阈值voc_score_threshold1,则将参照内容词汇Bi标注上目标内容词汇Kj关联的知识点标签。For example: the sentence set {S1, S2…Sn} obtained by segmentation is divided into chunks (Chunk) by sentence to obtain the word chunk sets {B11, B12…B1m1}, {B21, B22…}… corresponding to each sentence. Since the reference content vocabulary is mostly based on the vocabulary of the learning topic, in order to speed up the calculation speed, the TF-IDF algorithm based on keyword extraction is used to score the importance of each word chunk set {B11, B12…B1m1}, {B21, B22…}… to obtain the importance weight of the word chunk in each word chunk set. After sorting the importance weight in descending order, the first 1/3 of the word chunks are taken as the reference content vocabulary, that is, the word chunks with importance weights greater than the first preset weight are taken as reference content vocabulary. According to the content vocabulary, similarity is calculated with the target content vocabulary in turn. The reference content vocabulary is denoted as Bi and the target content vocabulary is denoted as Kj. The edit distance similarity sim_raw1, the edit distance familiarity sim_lemma1 and the semantic similarity sim_sem1 of Bi and Kj are calculated respectively. According to the above similarities, the total similarity value voc_score1 = vocα·sim_raw+vocβ··sim_lemma+vocγ·sim_sem is calculated. If the total similarity value is higher than the first preset similarity threshold voc_score_threshold1, the reference content vocabulary Bi is marked with the knowledge point label associated with the target content vocabulary Kj.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分块(Chunk)得到各个句子对应的词语块集合{B11,B12…B1m1},{B21,B22…}…,由于参照内容词汇大多是基于学习主题的词汇,为加快计算速度,采用基于关键词提取TF-IDF算法对每一个词语块集合{B11,B12…B1m1},{B21,B22…}…进行重要度打分得到各个词语块集合中词语块的重要程度权值,将重要程度权值按降序排序后,取其后1/3的词语块作为参照高频词汇,并依次与目标高频词汇进行相似度计算,参照高频词汇记为Ba,目标高频词汇记为Kb,分别计算Ba与Kb的编辑距离相似度sim_raw2,词性还原之后编辑距离相识度sim_lemma2以及语义相似度sim_sem2,根据上述相似度计算出总相似度值voc_score2=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于预设相似度阈值voc_score_threshold2,则将参照高频词汇Ba标注上目标高频词汇Kb关联的知识点标签。For example: the sentence set {S1, S2…Sn} obtained by segmentation is divided into chunks (Chunk) according to the sentences to obtain the word chunk sets {B11, B12…B1m1}, {B21, B22…}… corresponding to each sentence. Since the reference content vocabulary is mostly based on the vocabulary of the learning topic, in order to speed up the calculation, the TF-IDF algorithm based on keyword extraction is used to score the importance of each word chunk set {B11, B12…B1m1}, {B21, B22…}… to obtain the importance weight of the word chunk in each word chunk set. After sorting the importance weights in descending order, the last 1/3 of the word chunks are taken as reference high-frequency words, and are sequentially compared with the target high-frequency words. The reference high-frequency vocabulary is denoted as Ba, and the target high-frequency vocabulary is denoted as Kb. The edit distance similarity sim_raw2, the edit distance familiarity sim_lemma2 and the semantic similarity sim_sem2 of Ba and Kb are calculated respectively. According to the above similarities, the total similarity value voc_score2 = vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem is calculated. If the total similarity value is higher than the preset similarity threshold voc_score_threshold2, the reference high-frequency vocabulary Ba will be annotated with the knowledge point label associated with the target high-frequency vocabulary Kb.

S308,将目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中。S308: Add the basic knowledge point tags associated with the target content vocabulary and the basic knowledge point tags associated with the target high-frequency vocabulary to the basic knowledge point tag set.

其中,基础知识点标签集合是包含原始数据资源对应的基础知识点标签的集合,基础知识点标签集合包括的基础知识点标签为相似度值大于相似度阈值的目标知识点关联的基础知识点标签,根据基础知识点不同,与之对应的相似度阈值也不相同。基础知识点标签集合中的基础知识点标签的数量与原始数据资源的内容有关,且基础知识点标签集合中的基础知识点标签是与原始数据资源关联的,也即基础知识点标签集合中的基础知识点标签相当于原始数据资源已标注上与之对应的基础知识点标签。Among them, the basic knowledge point label set is a set of basic knowledge point labels corresponding to the original data resources. The basic knowledge point labels included in the basic knowledge point label set are basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold. Depending on the basic knowledge point, the corresponding similarity threshold is also different. The number of basic knowledge point labels in the basic knowledge point label set is related to the content of the original data resources, and the basic knowledge point labels in the basic knowledge point label set are associated with the original data resources, that is, the basic knowledge point labels in the basic knowledge point label set are equivalent to the basic knowledge point labels corresponding to the original data resources that have been annotated.

S309,对词语集合中的各个词语进行词性标注得到词性标注集合。S309, performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set.

其中,词性标注集合是指与词语集合中各个词语对应的词性集合,词语集合中的每个词语与词性标注集合中的每个词性存在一一对应的映射关系。词性标注集合中的词性可以包括名词、动词、形容词、数词、量词、代词、副词、介词、连词、助词、叹词、拟声词等,具体词性仍与原始数据资源中的学习内容相关。Among them, the part-of-speech tag set refers to the part-of-speech set corresponding to each word in the word set, and each word in the word set has a one-to-one mapping relationship with each part-of-speech in the part-of-speech tag set. The parts of speech in the part-of-speech tag set can include nouns, verbs, adjectives, numerals, quantifiers, pronouns, adverbs, prepositions, conjunctions, auxiliary words, interjections, onomatopoeia, etc. The specific part of speech is still related to the learning content in the original data resources.

一般的,词性标注过程通常需要依据句子上下文为每个词确定与之对应的最合适的词性,词性标注过程中也会存在一词多性的情况,如同一词既可以作名词也可以作动词,又称做兼类词,这种词在常用词中出现的概率很大。针对这种情况可通过利用概率的方法来解决,如:可利用HMM(Hidden Markov Model,隐马尔科夫模型)来处理这种词语的标注。除此之外,也可基于转换思想或基于分类思想的方法对词语进行词性标注。Generally, the part-of-speech tagging process usually needs to determine the most appropriate part-of-speech corresponding to each word based on the context of the sentence. In the part-of-speech tagging process, there may be a situation where a word has multiple parts, such as a word that can be used as both a noun and a verb, also known as a polysemous word, which has a high probability of appearing in common words. This situation can be solved by using probability methods, such as using HMM (Hidden Markov Model) to process the tagging of such words. In addition, words can also be tagged based on conversion ideas or classification ideas.

S310,将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇。S310, taking the words whose part of speech is a verb as reference verb vocabulary, and taking the words whose part of speech is a numeral as reference mathematical vocabulary.

一般的,通过对词语集合中的词语进行词性分析后,可得到原始数据资源中包含的词语对应的词性,故可根据词性将词性为动词词性和数词词性的词语分别筛选出来,将词性为动词词性的词语作为参照动词词汇知识点,将词性为数词词性的词语作为参照数学词汇知识点。Generally, by performing part-of-speech analysis on the words in the word set, the parts of speech corresponding to the words contained in the original data resource can be obtained. Therefore, the words with the verb part-of-speech and the numeral part-of-speech can be screened out separately according to the part-of-speech, and the words with the verb part-of-speech can be used as reference verb vocabulary knowledge points, and the words with the numeral part-of-speech can be used as reference mathematical vocabulary knowledge points.

S311,计算参照动词词汇与目标动词词汇的相似度值,以及参照数学词汇与目标数学词汇的相似度值。S311, calculating the similarity value between the reference verb vocabulary and the target verb vocabulary, and the similarity value between the reference math vocabulary and the target math vocabulary.

其中,参照动词词汇和参照数学词汇均是原始数据资源中包含的基础知识点,目标动词词汇和目标数学词汇均是知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是参照动词词汇与目标动词词汇的相似度值和参照数学词汇与目标数学词汇的相似度值。Among them, the reference verb vocabulary and the reference math vocabulary are both basic knowledge points contained in the original data resources, and the target verb vocabulary and the target math vocabulary are both target knowledge points in the knowledge graph corresponding to the course information of the original data resources. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity value between the reference verb vocabulary and the target verb vocabulary and the similarity value between the reference math vocabulary and the target math vocabulary.

S312,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,以及目标数学词汇关联的基础知识点标签。S312, when the similarity value is greater than the similarity threshold, obtaining basic knowledge point labels associated with the target verb vocabulary and basic knowledge point labels associated with the target mathematics vocabulary.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,参照动词词汇对应的相似度阈值与参照数学词汇对应的相似度阈值不同。Among them, the similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. Depending on the basic knowledge points, the corresponding similarity thresholds may also be different. The similarity threshold corresponding to the reference verb vocabulary is different from the similarity threshold corresponding to the reference mathematical vocabulary.

S313,将目标动词词汇关联的基础知识点标签和目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中。S313, adding the basic knowledge point labels associated with the target verb vocabulary and the basic knowledge point labels associated with the target mathematics vocabulary to the basic knowledge point label set.

其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge points of the original data resource contain basic knowledge points whose similarity value with the target knowledge point in the knowledge graph is greater than the similarity threshold, then the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning ability involved in the data resources, and by marking data resources with different labels, the screening and analysis of data resources can be realized.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对词语集合中的各个词语进行词性标注得词性标注集合,并将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇;将参照动词词汇依次与目标动词词汇进行相似度计算,参照动词词汇记为Vi,目标动词词汇记为Uj,将数词词汇依次与目标数词词汇进行相似度计算,数词词汇记为Va,目标数词词汇记为Ub,分别计算Vi与Uj的编辑距离相似度值sim_raw3,词性还原之后编辑距离相识度sim_lemma3以及语义相似度sim_sem3,根据上述相似度计算出总相似度值voc_score3=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于第三预设相似度阈值voc_score_threshold3,则将参照动词词汇Vi标注上目标动词词汇Uj关联的知识点标签;以及分别计算Va与Ub的编辑距离相似度值sim_raw4,词性还原之后编辑距离相识度sim_lemma4以及语义相似度sim_sem4,根据上述相似度计算出总相似度值voc_score4=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem,如果总相似度值高于第四预设相似度阈值voc_score_threshold4,则将数词词汇Va标注上目标数词词汇Ub关联的知识点标签。For example: the sentence set {S1, S2…Sn} obtained by segmentation is processed sentence by sentence to obtain the word set {{T11, T12…T1o1}, {T21, T22…T2o2}…} corresponding to each sentence, each word in the word set is tagged with part of speech to obtain a part of speech tagging set, and the words with the part of speech of verb part of speech are used as reference verb vocabulary, and the words with the part of speech of numeral part of speech are used as reference mathematical vocabulary; the reference verb vocabulary is calculated with the target verb vocabulary in turn, the reference verb vocabulary is recorded as Vi, and the target verb vocabulary is recorded as Uj; the numeral vocabulary is calculated with the target numeral vocabulary in turn, the numeral vocabulary is recorded as Va, and the target numeral vocabulary is recorded as Ub, and the edit distance similarity value sim_raw3 of Vi and Uj is calculated respectively, and the edit distance familiarity sim_lemma3 and semantic similarity sim_sem3 after part of speech restoration are calculated, and the total similarity value vo is calculated based on the above similarities c_score3=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem. If the total similarity value is higher than the third preset similarity threshold voc_score_threshold3, the knowledge point label associated with the target verb vocabulary Uj will be marked with the reference verb vocabulary Vi; and the edit distance similarity value sim_raw4 of Va and Ub, the edit distance familiarity sim_lemma4 and the semantic similarity sim_sem4 after part-of-speech restoration are calculated respectively. According to the above similarities, the total similarity value voc_score4=vocα·sim_raw+vocβ·sim_lemma+vocγ·sim_sem is calculated. If the total similarity value is higher than the fourth preset similarity threshold voc_score_threshold4, the numeral vocabulary Va will be marked with the knowledge point label associated with the target numeral vocabulary Ub.

S314,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合。S314, analyzing each word in the word set, and adding phonetic symbols to each word to obtain a phonetic symbol set.

其中,音标集合是包含词语集合中各个词语对应音标的集合,词语集合中的每个词语与音标集合中的每个音标存在一一对应的映射关系。The phonetic symbol set is a set containing phonetic symbols corresponding to each word in the word set, and there is a one-to-one mapping relationship between each word in the word set and each phonetic symbol in the phonetic symbol set.

S315,计算音标集合中的词语音标与目标音标的相似度值。S315, calculating the similarity value between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol.

其中,音标集合里的词语音标是原始数据资源中包含的基础知识点,目标音标是知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是词语音标与目标音标的相似度值。Among them, the word phonetic symbols in the phonetic symbol set are the basic knowledge points contained in the original data resources, and the target phonetic symbols are the target knowledge points in the knowledge graph corresponding to the course information of the original data resources. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity value between the word phonetic symbols and the target phonetic symbols.

S316,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签。S316: When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target phonetic symbol.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语音标对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. Depending on the basic knowledge points, the corresponding similarity thresholds may also be different. The similarity thresholds corresponding to the word phonetic symbols can be set arbitrarily according to needs, and may also be different from the similarity thresholds mentioned above.

S317,将目标音标关联的基础知识点标签加入到基础知识点标签集合中。S317, adding the basic knowledge point label associated with the target phonetic symbol to the basic knowledge point label set.

其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge points of the original data resource contain basic knowledge points whose similarity value with the target knowledge point in the knowledge graph is greater than the similarity threshold, then the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning ability involved in the data resources, and by marking data resources with different labels, the screening and analysis of data resources can be realized.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},通过字典工具将词语集合中的词语转化为与之对应的音标集合{{P11,P12…P1o1},{P21,P22…P2o2}…},依次将音标集合中的词语音标Pi与目标音标Kj进行相似度计算,包括发音组合是否在词语音标Pi对应的源单词Ti中的包含相似度sim_in,Pi与Kj的编辑距离相似度sim_edit,以及词语音标Pi与目标音标Kj的最长公共子串相似度sim_lcs,根据上述分数计算总相似度值phon_score=phonα·sim_in+phonβ·sim_edit+phonγ·sim_lcs,如果总相似度值高于预设相似度阈值phon_score_threshold,则将词语音标Pi标注上目标音标Kj关联的知识点标签。For example: the sentence set {S1, S2…Sn} obtained by segmentation is processed sentence by sentence to obtain the word set {{T11, T12…T1o1}, {T21, T22…T2o2}…} corresponding to each sentence, and the words in the word set are converted into the corresponding phonetic symbol set {{P11, P12…P1o1}, {P21, P22…P2o2}…} through the dictionary tool, and the word phonetic symbol Pi in the phonetic symbol set is calculated with the target phonetic symbol Kj in turn for similarity, including whether the pronunciation combination is in the source word Ti corresponding to the word phonetic symbol Pi The similarity sim_in, the edit distance similarity sim_edit between Pi and Kj, and the longest common substring similarity sim_lcs between the word phonetic symbol Pi and the target phonetic symbol Kj are included. According to the above scores, the total similarity value phon_score = phonα·sim_in+phonβ·sim_edit+phonγ·sim_lcs is calculated. If the total similarity value is higher than the preset similarity threshold phon_score_threshold, the word phonetic symbol Pi is annotated with the knowledge point label associated with the target phonetic symbol Kj.

S318,对句子集合中的句子进行依存句法分析得到依存句法树。S318, performing dependency syntactic analysis on the sentences in the sentence set to obtain a dependency syntactic tree.

其中,依存句法树是指可描述出数据资源中各个词语之间的依存关系的关系树,能表示出各个词语之间在句法上的搭配关系,这种搭配关系是和语义相关联的。依存句法分析的基本任务是确定句子的句法结构或者句子中词汇之间的依存关系,主要包括两方面的内容:确定语言的语法体系,即对语言中合法句子的语法结构给予形式化的定义;句法分析技术,即根据给定的语法体系,自动推导出句子的句法结构,分析句子所包含的句法单位和句法单位之间的关系。Among them, the dependency syntax tree refers to a relationship tree that can describe the dependency relationship between each word in the data resource, and can express the syntactic collocation relationship between each word, which is related to semantics. The basic task of dependency syntax analysis is to determine the syntactic structure of a sentence or the dependency relationship between words in a sentence. It mainly includes two aspects: determining the grammatical system of the language, that is, giving a formal definition of the grammatical structure of legal sentences in the language; syntactic analysis technology, that is, automatically deriving the syntactic structure of a sentence based on a given grammatical system, and analyzing the syntactic units contained in the sentence and the relationship between the syntactic units.

S319,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值。S319, calculating the similarity values between the words in the word set, the parts of speech in the part-of-speech tag set, and the dependency syntax tree and the words, parts of speech, and syntax trees in the target sentence respectively.

其中,相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是原始数据资源的词语集合中词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值。Among them, the similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity value corresponding to the words in the word set of the original data resource, the parts of speech in the part-of-speech tag set and the dependency syntax tree with the words, parts of speech and syntax tree in the target sentence.

S320,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签。S320: When the similarity value is greater than the similarity threshold, obtain a basic knowledge point label associated with the target sentence.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语集合中的词语、词性标注集合中的词性和依存句法树对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. Depending on the basic knowledge points, the corresponding similarity thresholds may also be different. The similarity thresholds corresponding to the words in the word set, the parts of speech in the part-of-speech tag set, and the dependency syntax tree can be set arbitrarily according to needs, and may also be different from the similarity thresholds mentioned above.

S321,将目标句式关联的基础知识点标签加入到基础知识点标签集合中。S321, adding the basic knowledge point tag associated with the target sentence pattern to the basic knowledge point tag set.

其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge points of the original data resource contain basic knowledge points whose similarity value with the target knowledge point in the knowledge graph is greater than the similarity threshold, then the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning ability involved in the data resources, and by marking data resources with different labels, the screening and analysis of data resources can be realized.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对句子集合{S1,S2…Sn}进行词性标注处理和依存句法分析处理,可分别得到词性标注集合{{Pos11,Pos12…Pos1o1},{Pos21,Pos22…Pos2o2}…},依存句法树{Tree1,Tree2…Treen},进而依次计算词语集合{Ti1,Ti2…Tim},Ti1={T11,T12…T1o1}与目标句式知识点的例句集合{KTj1,KTj2…KTjn}的jaccard相似度值sim_token_jaccard、词性标注集合{Posi1,Posi2…Posio1}与目标句式知识点例句的词性标注集合{KPosj1,KPosj2…KPosjn}间的编辑距离相似度sim_pos_edit,最长公共子串相似度值sim_pos_lcs,Treei与目标句式知识点例句KTreej的树相似度sim_tree,根据上述相似度值计算总相似度值sent_score=sentα·sim_token_jaccard+sentβ·sim_pos_edit+sentγ·sim_pos_lcs+sentθ·sim_tree,如果总相似度值高于预设相似度阈值sent_score_threshold,则将句子Si标注上Kj句式关联的基础知识点标签。For example: the sentence set {S1, S2…Sn} obtained by segmentation is processed by sentence segmentation to obtain the word set {{T11, T12…T1o1}, {T21, T22…T2o2}…} corresponding to each sentence, and the sentence set {S1, S2…Sn} is processed by part-of-speech tagging and dependency syntax analysis to obtain the part-of-speech tagging set {{Pos11, Pos12…Pos1o1}, {Pos21, Pos22…Pos2o2}…}, dependency syntax tree {Tree1, Tree2…Treen}, and then the jaccard similarity value sim_token_jaccard, part-of-speech tagging of the word set {Ti1, Ti2…Tim}, Ti1={T11, T12…T1o1} and the example sentence set {KTj1, KTj2…KTjn} of the target sentence knowledge point are calculated in turn. The edit distance similarity sim_pos_edit between the set {Posi1, Posi2…Posio1} and the part-of-speech tagging set {KPosj1, KPosj2…KPosjn} of the target sentence knowledge point example, the longest common substring similarity value sim_pos_lcs, the tree similarity sim_tree between Treei and the target sentence knowledge point example KTreej, and the total similarity value sent_score = sentα·sim_token_jaccard+sentβ·sim_pos_edit+sentγ·sim_pos_lcs+sentθ·sim_tree is calculated based on the above similarity values. If the total similarity value is higher than the preset similarity threshold sent_score_threshold, the sentence Si is annotated with the basic knowledge point label associated with the Kj sentence.

S322,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值。S322, calculating the similarity value between the grammar included in the sentence set and the target grammar based on the word set and the part-of-speech tag set.

其中,句子集合包含的语法为原始数据资源的基础知识点,目标语法为知识图谱中与原始数据资源的课程信息对应的目标知识点。相似度值是指所比较的两个量之间的相似关系,通常相似度值越大,表明两个量之间越相似,在这里相似度值可以是原始数据资源的词语集合中语法与目标语法对应的相似度值。The grammar contained in the sentence set is the basic knowledge point of the original data resource, and the target grammar is the target knowledge point in the knowledge graph corresponding to the course information of the original data resource. The similarity value refers to the similarity relationship between the two quantities being compared. Generally, the larger the similarity value, the more similar the two quantities are. Here, the similarity value can be the similarity value between the grammar in the word set of the original data resource and the target grammar.

S323,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签。S323: When the similarity value is greater than the similarity threshold, obtain the basic knowledge point label associated with the target grammar.

其中,相似度阈值是指相似度值能满足条件的最低下限值,也即相似度临界值,根据基础知识点的不同,与之对应的相似度阈值也可以不相同,词语集合中的语法对应的相似度阈值可根据需要任意设定,与前面所述的相似度阈值也可能不相同。Among them, the similarity threshold refers to the lowest lower limit value at which the similarity value can meet the conditions, that is, the similarity critical value. Depending on the basic knowledge points, the corresponding similarity thresholds may also be different. The similarity threshold corresponding to the grammar in the word set can be set arbitrarily according to needs, and may also be different from the similarity threshold described above.

S324,将目标语法关联的基础知识点标签加入到基础知识点标签集合中。S324, adding the basic knowledge point label associated with the target grammar to the basic knowledge point label set.

其中,基础知识点标签是与知识图谱中的目标知识点相关联的,若原始数据资源的基础知识点中包含与知识图谱中的目标知识点相似度值大于相似度阈值的基础知识点时,则该基础知识点可关联上与之对应的目标知识点的基础知识点标签。标签是用于描述数据资源特征的数据,不同的数据资源对应的标签数据不同,通过标签可有效表示数据资源涉及的知识点、学习内容或学习能力,且通过对不同标签对数据资源进行标注可实现对数据资源的筛选和分析。Among them, the basic knowledge point label is associated with the target knowledge point in the knowledge graph. If the basic knowledge points of the original data resource contain basic knowledge points whose similarity value with the target knowledge point in the knowledge graph is greater than the similarity threshold, then the basic knowledge point can be associated with the basic knowledge point label of the corresponding target knowledge point. Labels are data used to describe the characteristics of data resources. Different data resources correspond to different label data. Labels can effectively represent the knowledge points, learning content or learning ability involved in the data resources, and by marking data resources with different labels, the screening and analysis of data resources can be realized.

举例说明:将经过分割处理得到的句子集合{S1,S2…Sn}按句进行分词处理得到各个句子对应的词语集合{{T11,T12…T1o1},{T21,T22…T2o2}…},对句子集合{S1,S2…Sn}进行词性标注处理,可得到词性标注集合{{Pos11,Pos12…Pos1o1},{Pos21,Pos22…Pos2o2}…},计算语法片段是否包含在词语集合{Ti1,Ti2…Tim},Ti1={T11,T12…T1o1},Ti2={T21,T22…T2o2}…或者词性标注集合{KPosj1,KPosj2…KPosjn},KPosj1={Pos11,Pos12…Pos1o1},KPosj2={Pos21,Pos22…Pos2o2}…中的包含相似度sim_in,语法片段在词语Ti1={T11,T12…T1o1}与目标词语{KTj1,KTj2…KTjn}中的位置相似度sim_position,语法片段的词性相似度sim_pos,根据上述相似度计算总相似度值gram=gramα·sim_in+gramβ·sim_position+gramγ·sim_pos,如果总相似度值高于预设相似度阈值gram_score_threshold,则将句子Si标注上目标句子Kj语法相关联的基础知识点标签。For example: the sentence set {S1, S2…Sn} obtained after segmentation is processed sentence by sentence to obtain the word set {{T11, T12…T1o1}, {T21, T22…T2o2}…} corresponding to each sentence, and the sentence set {S1, S2…Sn} is processed by part-of-speech tagging to obtain the part-of-speech tagging set {{Pos11, Pos12…Pos1o1}, {Pos21, Pos22…Pos2o2}…}, and calculate whether the grammatical fragment is included in the word set {Ti1, Ti2…Tim}, Ti1={T11, T12…T1o1}, Ti2={T21, T22…T2o2}… or the part-of-speech tagging set {KPosj1, KPosj2…KPosjn}, KPosj1={Pos11 ,Pos12…Pos1o1}, KPosj2={Pos21,Pos22…Pos2o2}…in inclusion similarity sim_in, position similarity sim_position of the grammatical fragment in the word Ti1={T11,T12…T1o1} and the target word {KTj1,KTj2…KTjn}, part-of-speech similarity sim_pos of the grammatical fragment, according to the above similarities, calculate the total similarity value gram=gramα·sim_in+gramβ·sim_position+gramγ·sim_pos, if the total similarity value is higher than the preset similarity threshold gram_score_threshold, then mark the sentence Si with the basic knowledge point label associated with the target sentence Kj grammar.

举例说明:计算可得原始数据资源对应的基础知识点标签与数量,可记为:Tag(Text)={voc:num1,verb:num2,math:num3,hfw:num4,phon:num5,sent:num5,gram:num7|numi>=0}。整个计算原始数据资源的基础知识点与知识图谱中与原始数据资源的课程信息对应的目标知识点的相似度的流程,可参见图4。For example, the calculation can obtain the labels and quantities of basic knowledge points corresponding to the original data resources, which can be recorded as: Tag(Text) = {voc:num1,verb:num2,math:num3,hfw:num4,phon:num5,sent:num5,gram:num7|numi> = 0}. The entire process of calculating the similarity between the basic knowledge points of the original data resources and the target knowledge points corresponding to the course information of the original data resources in the knowledge graph can be seen in Figure 4.

S325,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。S325 , generating a comprehensive knowledge point label set of the original data resource according to the feature information of the original data resource and the basic knowledge point label set.

其中,特征信息是指原始数据资源的类型,如原始数据资源的类型可以包括音频、视频、文本、绘本等类型,不同的类型能训练/锻炼的学习能力(如:听、说、读、写能力)不同。综合知识点标签集合中的综合知识点标签是指基于原始数据资源的特征信息和基础知识点标签集合生成的能反应该原始数据资源锻炼用户(学生)能力的知识点标签,一种原始数据资源对应的综合知识点标签可以有多个,如:原始数据资源包含音频,且基础知识点标签集合中含有与目标音标相关联的知识点标签,则可分析得到该原始数据资源对应的综合知识点标签为听力标签。Among them, the characteristic information refers to the type of the original data resource, such as the type of the original data resource can include audio, video, text, picture book and other types, and different types can train/exercise different learning abilities (such as listening, speaking, reading and writing abilities). The comprehensive knowledge point label in the comprehensive knowledge point label set refers to the knowledge point label that can reflect the ability of the original data resource to train the user (student) based on the characteristic information of the original data resource and the basic knowledge point label set. There can be multiple comprehensive knowledge point labels corresponding to one original data resource. For example, if the original data resource contains audio, and the basic knowledge point label set contains knowledge point labels associated with the target phonetic symbol, then it can be analyzed that the comprehensive knowledge point label corresponding to the original data resource is a listening label.

一般的,原始数据资源的综合知识点标签与原始数据资源自身的特征有关,综合知识点标签集合中的综合知识点标签与基础知识点标签集合中的基础知识点标签均与原始数据资源相关联,在基础知识点标签集合和综合知识点标签集合生成后,也即表明原始数据资源已标注上了相关的知识点标签(包括基础知识点标签和综合知识点标签)。Generally, the comprehensive knowledge point labels of the original data resources are related to the characteristics of the original data resources themselves. The comprehensive knowledge point labels in the comprehensive knowledge point label set and the basic knowledge point labels in the basic knowledge point label set are both associated with the original data resources. After the basic knowledge point label set and the comprehensive knowledge point label set are generated, it means that the original data resources have been marked with relevant knowledge point labels (including basic knowledge point labels and comprehensive knowledge point labels).

本申请实施例的方案在执行时,服务器对原始数据资源进行预处理获取文本数据,从预设知识图谱中查询原始数据资源对应的属性信息,得到与属性信息对应的多个目标知识点,对文本数据进行句子分割处理得到句子集合,并对句子集合进行分块处理得到词语块集合,基于关键词提取TF-IDF算法计算词语块集合中的各个词语块的重要程度权值,将重要程度权值大于第一预设权值的词语块作为参照内容词汇,将重要程度权值小于或等于第二预设权值的词语块作为参照高频词汇,计算参照内容词汇与目标内容词汇,以及参照高频词汇与目标高频词汇的相似度值,在相似度值大于相似度阈值时,获取目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签,将目标内容词汇关联的基础知识点标签和目标高频词汇关联的基础知识点标签加入到基础知识点标签集合中,基于词语集合对词语集合中的各个词语进行词性标注得到词性标注集合,将词性为动词词性的词语作为参照动词词汇,将词性为数词词性的词语作为参照数学词汇,计算参照动词词汇与目标动词词汇的相似度值,以及参照数学词汇与目标数学词汇的相似度值,在相似度值大于相似度阈值时,获取目标动词词汇关联的基础知识点标签,以及目标数学词汇关联的基础知识点标签,将目标动词词汇关联的基础知识点标签和目标数学词汇关联的基础知识点标签加入到基础知识点标签集合中,分析词语集合中的各个词语,并为各个词语标上音标得到音标集合,计算音标集合中的词语音标与目标音标的相似度值,在相似度值大于相似度阈值时,获取目标音标关联的基础知识点标签,将目标音标关联的基础知识点标签加入到基础知识点标签集合中,对句子集合中的句子进行依存句法分析得到依存句法树,计算词语集合中的词语、词性标注集合中的词性和依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值,在相似度值大于相似度阈值时,获取目标句式关联的基础知识点标签,将目标句式关联的基础知识点标签加入到基础知识点标签集合中,基于词语集合和词性标注集合计算句子集合包含的语法与目标语法的相似度值,在相似度值大于相似度阈值时,获取目标语法关联的基础知识点标签,将目标语法关联的基础知识点标签加入到基础知识点标签集合中,根据原始数据资源的特征信息和基础知识点标签集合生成原始数据资源的综合知识点标签集合。通过此种方式可快速且准确地为原始数据资源标注上与之相关的知识点标签,提高标注的效率和标注的准确率。When the scheme of the embodiment of the present application is executed, the server pre-processes the original data resource to obtain text data, queries the attribute information corresponding to the original data resource from the preset knowledge graph, obtains multiple target knowledge points corresponding to the attribute information, performs sentence segmentation processing on the text data to obtain a sentence set, and performs block processing on the sentence set to obtain a word block set, calculates the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm, takes the word block with an importance weight greater than the first preset weight as the reference content word, takes the word block with an importance weight less than or equal to the second preset weight as the reference high-frequency word, calculates the reference content word and the target content word, and the reference high-frequency word The similarity value between the target high-frequency vocabulary and the target content vocabulary is calculated. When the similarity value is greater than the similarity threshold, the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary are obtained, and the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary are added to the basic knowledge point label set. Based on the word set, each word in the word set is tagged with a part of speech to obtain a part of speech tag set. The words with a verb part of speech are used as reference verb vocabulary, and the words with a numeral part of speech are used as reference mathematical vocabulary. The similarity value between the reference verb vocabulary and the target verb vocabulary, as well as the similarity value between the reference mathematical vocabulary and the target mathematical vocabulary are calculated. When the similarity value is greater than the similarity threshold, When the target verb vocabulary is obtained, the basic knowledge point label associated with the target mathematical vocabulary is obtained, and the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target mathematical vocabulary are added to the basic knowledge point label set, each word in the word set is analyzed, and each word is marked with a phonetic symbol to obtain a phonetic symbol set, and the similarity value between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol is calculated. When the similarity value is greater than the similarity threshold, the basic knowledge point label associated with the target phonetic symbol is obtained, and the basic knowledge point label associated with the target phonetic symbol is added to the basic knowledge point label set, and the sentences in the sentence set are subjected to dependency syntactic analysis to obtain a dependency syntactic tree, and the words, words and phrases in the word set are calculated. The similarity values of the parts of speech and dependency syntax trees in the part-of-speech tagging set and the words, parts of speech and syntax trees in the target sentence are respectively obtained. When the similarity value is greater than the similarity threshold, the basic knowledge point labels associated with the target sentence are obtained, and the basic knowledge point labels associated with the target sentence are added to the basic knowledge point label set. The similarity values of the grammar contained in the sentence set and the target grammar are calculated based on the word set and the part-of-speech tagging set. When the similarity value is greater than the similarity threshold, the basic knowledge point labels associated with the target grammar are obtained, and the basic knowledge point labels associated with the target grammar are added to the basic knowledge point label set. The comprehensive knowledge point label set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point label set. In this way, the original data resource can be quickly and accurately labeled with the knowledge point labels related to it, thereby improving the efficiency and accuracy of the labeling.

正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiments are mainly described by taking the online education industry as an example, but those skilled in the art understand that the applicability of the present method is not limited to the online education industry. For example, the method described in this application can be applied to user tag processing in various industries such as retail, transportation, social networking, search, education, and medical care.

下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The following are device embodiments of the present application, which can be used to execute the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

请参见图5,其示出了本申请一个示例性实施例提供的数据资源的标注装置的结构示意图。以下简称装置5,装置5可以通过软件、硬件或者两者的结合实现成为终端的全部或一部分。装置5包括预处理模块501、计算模块502、第一处理模块503、第二处理模块504。Please refer to FIG5 , which shows a schematic diagram of the structure of a data resource annotation device provided by an exemplary embodiment of the present application. Hereinafter referred to as device 5, device 5 can be implemented as all or part of a terminal through software, hardware, or a combination of both. Device 5 includes a preprocessing module 501, a computing module 502, a first processing module 503, and a second processing module 504.

预处理模块501,用于对原始数据资源进行预处理获取文本数据;A preprocessing module 501 is used to preprocess the original data resources to obtain text data;

计算模块502,用于将所述文本数据分别和多个目标知识点进行相似度值计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;A calculation module 502 is used to calculate similarity values between the text data and multiple target knowledge points respectively to obtain similarity values; wherein each of the multiple target knowledge points is associated with a basic knowledge point label;

第一处理模块503,用于根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;The first processing module 503 is used to generate a basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold; wherein the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold;

第二处理模块504,用于根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。The second processing module 504 is used to generate a comprehensive knowledge point label set of the original data resource according to the feature information of the original data resource and the basic knowledge point label set.

可选地,所述装置5还包括:Optionally, the device 5 further comprises:

查询单元,用于从预设知识图谱中查询所述原始数据资源对应的属性信息,得到与所述属性信息对应的所述多个目标知识点A query unit is used to query the attribute information corresponding to the original data resource from the preset knowledge graph to obtain the multiple target knowledge points corresponding to the attribute information.

可选地,所述装置5中的所述原始数据资源为教学资源,所述属性信息为课程信息。Optionally, the original data resource in the device 5 is a teaching resource, and the attribute information is course information.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第一处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分块处理得到词语块集合,以及进行分词处理得到词语集合;A first processing unit is used to perform sentence segmentation processing on the text data to obtain a sentence set, and to perform block processing on the sentence set to obtain a word block set, and to perform word segmentation processing on the sentence set to obtain a word set;

第一计算单元,用于分析所述词语块集合和所述词语集合得到参照词汇集合,并计算所述参照词汇集合中的各参照词汇与各自对应的目标知识点的相似度值;其中,所述参照词汇集合中包括参照内容词汇、参照高频词汇、参照动词词汇和参照数学词汇,所述参照内容词汇对应的目标知识点为目标内容词汇,所述参照高频词汇对应的目标知识点为目标高频词汇,所述参照动词词汇对应的目标知识点为目标动词词汇,所述参照数学词汇对应的目标知识点为目标数学词汇;A first calculation unit is used to analyze the word block set and the word set to obtain a reference word set, and calculate the similarity value between each reference word in the reference word set and the corresponding target knowledge point; wherein the reference word set includes reference content words, reference high-frequency words, reference verb words and reference mathematics words, the target knowledge point corresponding to the reference content words is the target content words, the target knowledge point corresponding to the reference high-frequency words is the target high-frequency words, the target knowledge point corresponding to the reference verb words is the target verb words, and the target knowledge point corresponding to the reference mathematics words is the target mathematics words;

第一加入单元,用于在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中;或A first adding unit is used to add the basic knowledge point labels corresponding to the respective corresponding target knowledge points to the basic knowledge point label set when the similarity values of the respective corresponding target knowledge points are greater than the respective corresponding similarity thresholds; or

第二处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分词处理得到词语集合;A second processing unit is used to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第二计算单元,用于分析所述词语集合分别得到音标集合、词性标注集合和依存句法树,并计算所述音标集合中的词语音标、所述句子集合中的句式和所述句子集合中的语法与各自对应的目标知识点的相似度值;其中,所述词语音标对应的目标知识点为目标音标,所述句式对应的目标知识点为目标句式,所述语法对应的目标知识点为目标语法;The second calculation unit is used to analyze the word set to obtain a phonetic symbol set, a part-of-speech tag set and a dependency syntax tree, and calculate the similarity values between the word phonetic symbols in the phonetic symbol set, the sentence patterns in the sentence set and the grammar in the sentence set and their corresponding target knowledge points; wherein the target knowledge point corresponding to the word phonetic symbol is the target phonetic symbol, the target knowledge point corresponding to the sentence pattern is the target sentence pattern, and the target knowledge point corresponding to the grammar is the target grammar;

第二加入单元,用于在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中。The second adding unit is used to add the basic knowledge point labels corresponding to the respective corresponding target knowledge points to the basic knowledge point label set when the similarity values of the respective corresponding target knowledge points are greater than the respective corresponding similarity thresholds.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第三处理单元,用于对所述学习文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;A third processing unit is used to perform sentence segmentation processing on the learning text data to obtain a sentence set, and to perform block processing on the sentence set to obtain a word block set;

第三计算单元,用于基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;A third calculation unit is used to calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;

第一选择单元,用于将所述重要程度权值大于第一预设权值的词语块作为所述参照内容词汇;A first selection unit, configured to select the word block whose importance weight is greater than a first preset weight as the reference content vocabulary;

第四计算单元,用于计算所述参照内容词汇与所述目标内容词汇的相似度值;a fourth calculation unit, configured to calculate a similarity value between the reference content vocabulary and the target content vocabulary;

第一获取单元,用于在相似度值大于相似度阈值时,获取所述目标内容词汇关联的基础知识点标签;A first acquisition unit, configured to acquire a basic knowledge point label associated with the target content vocabulary when the similarity value is greater than a similarity threshold;

第一添加单元,用于将所述目标内容词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The first adding unit is used to add the basic knowledge point tag associated with the target content vocabulary to the basic knowledge point tag set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第四处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;A fourth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and to perform block processing on the sentence set to obtain a word block set;

第四计算单元,用于基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;A fourth calculation unit, used for calculating the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;

第二选择单元,用于将所述重要程度权值小于或等于第二预设权值的词语块作为所述参照高频词汇;A second selection unit, configured to select the word block whose importance weight is less than or equal to a second preset weight as the reference high-frequency word;

第五计算单元,用于计算所述参照高频词汇与所述目标高频词汇的相似度值;a fifth calculation unit, configured to calculate a similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary;

第二获取单元,用于在相似度值大于相似度阈值时,获取所述目标高频词汇关联的基础知识点标签;A second acquisition unit, configured to acquire a basic knowledge point label associated with the target high-frequency vocabulary when the similarity value is greater than a similarity threshold;

第二添加单元,用于将所述目标高频词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The second adding unit is used to add the basic knowledge point label associated with the target high-frequency vocabulary into the basic knowledge point label set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第五处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;A fifth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第一标注单元,用于对所述词语集合中的各个词语进行词性标注得到词性标注集合;A first tagging unit, configured to perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

第三选择单元,用于将所述词性为动词词性的词语作为参照动词词汇;A third selection unit is used to use the word whose part of speech is a verb as a reference verb vocabulary;

第六计算单元,用于计算所述参照动词词汇与所述目标动词词汇的相似度值;a sixth calculation unit, configured to calculate a similarity value between the reference verb vocabulary and the target verb vocabulary;

第三获取单元,用于在相似度值大于相似度阈值时,获取所述目标动词词汇关联的基础知识点标签;A third acquisition unit, configured to acquire a basic knowledge point label associated with the target verb vocabulary when the similarity value is greater than a similarity threshold;

第三添加单元,用于将所述目标动词词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The third adding unit is used to add the basic knowledge point label associated with the target verb vocabulary to the basic knowledge point label set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第六处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;a sixth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第二标注单元,用于对所述词语集合中的各个词语进行词性标注得到词性标注集合;A second tagging unit, configured to perform part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

第四选择单元,用于将所述词性为数词词性的词语作为参照数学词汇;A fourth selection unit, configured to use the words whose part of speech is a numeral as reference mathematical vocabulary;

第七计算单元,用于计算所述参照数学词汇与所述目标数学词汇的相似度值;a seventh calculation unit, configured to calculate a similarity value between the reference mathematical vocabulary and the target mathematical vocabulary;

第四获取单元,用于在相似度值大于相似度阈值时,获取所述目标数学词汇关联的基础知识点标签;A fourth acquisition unit, configured to acquire a basic knowledge point label associated with the target mathematical vocabulary when the similarity value is greater than a similarity threshold;

第四添加单元,用于将所述目标数学词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The fourth adding unit is used to add the basic knowledge point label associated with the target mathematical vocabulary into the basic knowledge point label set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第七处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;a seventh processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第三标注单元,用于分析所述词语集合中的各个词语,并为所述词语标上音标得到音标集合;A third marking unit is used to analyze each word in the word set and mark the words with phonetic symbols to obtain a phonetic symbol set;

第八计算单元,用于计算所述音标集合中的词语音标与所述目标音标的相似度值;an eighth calculation unit, configured to calculate a similarity value between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol;

第五选择单元,用于在相似度值大于相似度阈值时,获取所述目标音标关联的基础知识点标签;A fifth selection unit, configured to obtain a basic knowledge point label associated with the target phonetic symbol when the similarity value is greater than a similarity threshold;

第五添加单元,用于将所述目标音标关联的基础知识点标签加入到所述基础知识点标签集合中。The fifth adding unit is used to add the basic knowledge point label associated with the target phonetic symbol to the basic knowledge point label set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第八处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;an eighth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第四标注单元,用于基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;A fourth tagging unit, configured to perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;

分析单元,用于对所述句子集合中的句子进行依存句法分析得到依存句法树;An analysis unit, configured to perform dependency syntactic analysis on the sentences in the sentence set to obtain a dependency syntactic tree;

第九计算单元,用于计算所述词语集合中的词语、所述词性标注集合中的词性和所述依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值;A ninth calculation unit, configured to calculate similarity values corresponding to the words in the word set, the parts of speech in the part-of-speech tag set, and the dependency syntax tree and the words, parts of speech, and syntax tree in the target sentence respectively;

第五获取单元,用于在相似度值大于相似度阈值时,获取所述目标句式关联的基础知识点标签;A fifth acquisition unit, configured to acquire a basic knowledge point label associated with the target sentence when the similarity value is greater than a similarity threshold;

第六添加单元,用于将所述目标句式关联的基础知识点标签加入到所述基础知识点标签集合中。The sixth adding unit is used to add the basic knowledge point tag associated with the target sentence pattern to the basic knowledge point tag set.

可选地,所述第一处理模块503包括:Optionally, the first processing module 503 includes:

第九处理单元,用于对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;A ninth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

第五标注单元,用于基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;A fifth tagging unit, configured to perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;

第十计算单元,用于基于所述词语集合和所述词性标注集合计算所述句子集合包含的语法与目标语法的相似度值;a tenth calculation unit, configured to calculate a similarity value between the grammar included in the sentence set and the target grammar based on the word set and the part-of-speech tag set;

第六获取单元,用于在相似度值大于相似度阈值时,获取所述目标语法关联的基础知识点标签;A sixth acquisition unit, configured to acquire a basic knowledge point label associated with the target grammar when the similarity value is greater than a similarity threshold;

第七添加单元,用于将所述目标语法关联的基础知识点标签加入到所述基础知识点标签集合中。A seventh adding unit is used to add the basic knowledge point label associated with the target grammar to the basic knowledge point label set.

需要说明的是,上述实施例提供的装置5在执行数据资源的标注方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据资源的标注方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。It should be noted that the device 5 provided in the above embodiment only uses the division of the above functional modules as an example when executing the data resource labeling method. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the data resource labeling method embodiments provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.

正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiments are mainly described by taking the online education industry as an example, but those skilled in the art understand that the applicability of the present method is not limited to the online education industry. For example, the method described in this application can be applied to user tag processing in various industries such as retail, transportation, social networking, search, education, and medical care.

图6为本申请实施例提供的一种数据资源的标注装置结构示意图,以下简称装置6,装置6可以集成于前述服务器或终端设备中,如图6所示,该装置包括:存储器602、处理器601、输入装置603、输出装置604和通信接口。Figure 6 is a schematic diagram of the structure of a data resource annotation device provided in an embodiment of the present application, hereinafter referred to as device 6. Device 6 can be integrated into the aforementioned server or terminal device. As shown in Figure 6, the device includes: a memory 602, a processor 601, an input device 603, an output device 604 and a communication interface.

存储器602可以是独立的物理单元,与处理器601、输入装置603和输出装置604可以通过总线连接。存储器602、处理器601、输入装置603和输出装置604也可以集成在一起,通过硬件实现等。The memory 602 may be an independent physical unit, and may be connected to the processor 601, the input device 603, and the output device 604 via a bus. The memory 602, the processor 601, the input device 603, and the output device 604 may also be integrated together and implemented by hardware.

存储器602用于存储实现以上方法实施例,或者装置实施例各个模块的程序,处理器601调用该程序,执行以上方法实施例的操作。The memory 602 is used to store programs for implementing the above method embodiments or various modules of the device embodiments, and the processor 601 calls the program to execute the operations of the above method embodiments.

输入装置602包括但不限于键盘、鼠标、触摸面板、摄像头和麦克风;输出装置包括但限于显示屏。The input device 602 includes but is not limited to a keyboard, a mouse, a touch panel, a camera and a microphone; the output device includes but is not limited to a display screen.

通信接口用于收发各种类型的消息,通信接口包括但不限于无线接口或有线接口。The communication interface is used to send and receive various types of messages, and the communication interface includes but is not limited to a wireless interface or a wired interface.

可选地,当上述实施例的分布式任务调度方法中的部分或全部通过软件实现时,装置也可以只包括处理器。用于存储程序的存储器位于装置之外,处理器通过电路/电线与存储器连接,用于读取并执行存储器中存储的程序。Optionally, when part or all of the distributed task scheduling method in the above embodiment is implemented by software, the device may also include only a processor. The memory for storing the program is located outside the device, and the processor is connected to the memory through a circuit/wire to read and execute the program stored in the memory.

处理器可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。The processor may be a central processing unit (CPU), a network processor (NP), or a combination of a CPU and a NP.

处理器还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmablelogic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complexprogrammable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gatearray,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.

存储器可以包括易失性存储器(volatile memory),例如存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器还可以包括上述种类的存储器的组合。The memory may include volatile memory, such as random-access memory (RAM); the memory may also include non-volatile memory, such as flash memory, hard disk drive (HDD) or solid-state drive (SSD); the memory may also include a combination of the above types of memory.

其中,处理器601调用存储器602中的程序代码用于执行以下步骤:The processor 601 calls the program code in the memory 602 to perform the following steps:

对原始数据资源进行预处理获取文本数据;Preprocess the original data resources to obtain text data;

将所述文本数据分别和多个目标知识点进行相似度值计算得到相似度值;其中,所述多个目标知识点各自关联有一个基础知识点标签;Calculating similarity values between the text data and multiple target knowledge points respectively to obtain similarity values; wherein each of the multiple target knowledge points is associated with a basic knowledge point label;

根据相似度值和相似度阈值的比较结果生成所述原始数据资源的基础知识点标签集合;其中,所述基础知识点标签集合包括的基础知识点标签为:相似度值大于相似度阈值的目标知识点关联的基础知识点标签;Generate a basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold; wherein the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with target knowledge points whose similarity values are greater than the similarity threshold;

根据所述原始数据资源的特征信息和所述基础知识点标签集合生成所述原始数据资源的综合知识点标签集合。A comprehensive knowledge point label set of the original data resource is generated according to the feature information of the original data resource and the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

从预设知识图谱中查询所述原始数据资源对应的属性信息,得到与所述属性信息对应的所述多个目标知识点。The attribute information corresponding to the original data resource is queried from a preset knowledge graph to obtain the multiple target knowledge points corresponding to the attribute information.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分块处理得到词语块集合,以及进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set, and performing word segmentation processing on the sentence set to obtain a word set;

分析所述词语块集合和所述词语集合得到参照词汇集合,并计算所述参照词汇集合中的各参照词汇与各自对应的目标知识点的相似度值;其中,所述参照词汇集合中包括参照内容词汇、参照高频词汇、参照动词词汇和参照数学词汇,所述参照内容词汇对应的目标知识点为目标内容词汇,所述参照高频词汇对应的目标知识点为目标高频词汇,所述参照动词词汇对应的目标知识点为目标动词词汇,所述参照数学词汇对应的目标知识点为目标数学词汇;Analyze the word block set and the word set to obtain a reference word set, and calculate the similarity value between each reference word in the reference word set and the corresponding target knowledge point; wherein the reference word set includes reference content words, reference high-frequency words, reference verb words and reference mathematics words, the target knowledge point corresponding to the reference content words is the target content words, the target knowledge point corresponding to the reference high-frequency words is the target high-frequency words, the target knowledge point corresponding to the reference verb words is the target verb words, and the target knowledge point corresponding to the reference mathematics words is the target mathematics words;

在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中;或When the similarity values of the respective corresponding target knowledge points are greater than the respective corresponding similarity thresholds, adding the basic knowledge point labels corresponding to the respective corresponding target knowledge points to the basic knowledge point label set; or

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合分别进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

分析所述词语集合分别得到音标集合、词性标注集合和依存句法树,并计算所述音标集合中的词语音标、所述句子集合中的句式和所述句子集合中的语法与各自对应的目标知识点的相似度值;其中,所述词语音标对应的目标知识点为目标音标,所述句式对应的目标知识点为目标句式,所述语法对应的目标知识点为目标语法;Analyze the word set to obtain a phonetic symbol set, a part-of-speech tag set and a dependency syntax tree, and calculate similarity values between the word phonetic symbols in the phonetic symbol set, the sentence patterns in the sentence set and the grammar in the sentence set and their corresponding target knowledge points; wherein the target knowledge point corresponding to the word phonetic symbol is the target phonetic symbol, the target knowledge point corresponding to the sentence pattern is the target sentence pattern, and the target knowledge point corresponding to the grammar is the target grammar;

在所述各自对应的目标知识点的相似度值大于各自对应的相似度阈值时,将所述各自对应的目标知识点所对应的基础知识点标签加入到所述基础知识点标签集合中。When the similarity values of the respectively corresponding target knowledge points are greater than the respectively corresponding similarity thresholds, the basic knowledge point labels corresponding to the respectively corresponding target knowledge points are added to the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set;

基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;Calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;

将所述重要程度权值大于第一预设权值的词语块作为所述参照内容词汇;The word block whose importance weight is greater than the first preset weight is used as the reference content vocabulary;

计算所述参照内容词汇与所述目标内容词汇的相似度值;Calculating a similarity value between the reference content vocabulary and the target content vocabulary;

在相似度值大于相似度阈值时,获取所述目标内容词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target content vocabulary;

将所述目标内容词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tags associated with the target content vocabulary are added to the basic knowledge point tag set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分块处理得到词语块集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing block processing on the sentence set to obtain a word block set;

基于关键词提取TF-IDF算法计算所述词语块集合中的各个词语块的重要程度权值;Calculate the importance weight of each word block in the word block set based on the keyword extraction TF-IDF algorithm;

将所述重要程度权值小于或等于第二预设权值的词语块作为所述参照高频词汇;The word block whose importance weight is less than or equal to the second preset weight is used as the reference high-frequency word;

计算所述参照高频词汇与所述目标高频词汇的相似度值;Calculating a similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary;

在相似度值大于相似度阈值时,获取所述目标高频词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target high-frequency vocabulary;

将所述目标高频词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point labels associated with the target high-frequency vocabulary are added to the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

对所述词语集合中的各个词语进行词性标注得到词性标注集合;Performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

将所述词性为动词词性的词语作为参照动词词汇;The word whose part of speech is a verb is used as a reference verb vocabulary;

计算所述参照动词词汇与所述目标动词词汇的相似度值;Calculating a similarity value between the reference verb vocabulary and the target verb vocabulary;

在相似度值大于相似度阈值时,获取所述目标动词词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target verb vocabulary;

将所述目标动词词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point label associated with the target verb vocabulary is added to the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

对所述词语集合中的各个词语进行词性标注得到词性标注集合;Performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

将所述词性为数词词性的词语作为参照数学词汇;The words whose part of speech is numeral part of speech are used as reference mathematical words;

计算所述参照数学词汇与所述目标数学词汇的相似度值;Calculating a similarity value between the reference mathematical vocabulary and the target mathematical vocabulary;

在相似度值大于相似度阈值时,获取所述目标数学词汇关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target mathematical vocabulary;

将所述目标数学词汇关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point labels associated with the target mathematical vocabulary are added to the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

分析所述词语集合中的各个词语,并为所述词语标上音标得到音标集合;Analyze each word in the word set, and add phonetic symbols to the words to obtain a phonetic symbol set;

计算所述音标集合中的词语音标与所述目标音标的相似度值;Calculating the similarity between the word phonetic symbol in the phonetic symbol set and the target phonetic symbol;

在相似度值大于相似度阈值时,获取所述目标音标关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target phonetic symbol;

将所述目标音标关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point label associated with the target phonetic symbol is added to the basic knowledge point label set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

基于所述词语集合对所述词语集合中的各个词语进行词性标注得到词性标注集合;Based on the word set, each word in the word set is tagged with a part of speech to obtain a part of speech tag set;

对所述句子集合中的句子进行依存句法分析得到依存句法树;Performing dependency syntactic analysis on the sentences in the sentence set to obtain a dependency syntactic tree;

计算所述词语集合中的词语、所述词性标注集合中的词性和所述依存句法树分别与目标句式中的词语、词性和句法树对应的相似度值;Calculate the similarity values of the words in the word set, the parts of speech in the part-of-speech tag set, and the dependency syntax tree with the words, parts of speech, and syntax tree in the target sentence respectively;

在相似度值大于相似度阈值时,获取所述目标句式关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target sentence;

将所述目标句式关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point tag associated with the target sentence pattern is added to the basic knowledge point tag set.

在一个或多个实施例中,处理器601还用于:In one or more embodiments, the processor 601 is further configured to:

对所述文本数据进行句子分割处理得到句子集合,并对所述句子集合进行分词处理得到词语集合;Performing sentence segmentation processing on the text data to obtain a sentence set, and performing word segmentation processing on the sentence set to obtain a word set;

对所述词语集合中的各个词语进行词性标注得到词性标注集合;Performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

基于所述词语集合和所述词性标注集合计算所述句子集合包含的语法与目标语法的相似度值;Calculating a similarity value between the grammar included in the sentence set and the target grammar based on the word set and the part-of-speech tag set;

在相似度值大于相似度阈值时,获取所述目标语法关联的基础知识点标签;When the similarity value is greater than the similarity threshold, obtaining a basic knowledge point label associated with the target grammar;

将所述目标语法关联的基础知识点标签加入到所述基础知识点标签集合中。The basic knowledge point labels associated with the target grammar are added to the basic knowledge point label set.

需要说明的是,上述实施例提供的装置6在执行数据资源的标注方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据资源的标注方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。It should be noted that the device 6 provided in the above embodiment only uses the division of the above functional modules as an example when executing the data resource labeling method. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the data resource labeling method embodiments provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.

正如前面描述,实施例主要以在线教育行业为例进行了描述,但本领域技术人员明白,本方法的适用并不局限于在线教育行业,例如在零售、交通、社交、搜索、教育、医疗等各个行业的用户标签处理,均可以适用本申请所描述的方法。As described above, the embodiments are mainly described by taking the online education industry as an example, but those skilled in the art understand that the applicability of the present method is not limited to the online education industry. For example, the method described in this application can be applied to user tag processing in various industries such as retail, transportation, social networking, search, education, and medical care.

上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are for description only and do not represent the advantages or disadvantages of the embodiments.

本申请实施例还提供了一种计算机存储介质,所述计算机存储介质可以存储有多条指令,所述指令适于由处理器加载并执行如上述图2~图3所示实施例的方法步骤,具体执行过程可以参见图2~图3所示实施例的具体说明,在此不进行赘述。The embodiment of the present application also provides a computer storage medium, which can store multiple instructions, and the instructions are suitable for being loaded by a processor and executing the method steps of the embodiments shown in Figures 2 to 3 above. The specific execution process can be found in the specific description of the embodiments shown in Figures 2 to 3, and will not be repeated here.

本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

Claims (13)

1. A method for labeling data resources, the method comprising:
preprocessing an original data resource to obtain text data;
carrying out similarity calculation on the text data and a plurality of target knowledge points respectively to obtain a similarity value; wherein, each of the plurality of target knowledge points is associated with a basic knowledge point tag;
generating a basic knowledge point tag set of the original data resource according to a comparison result of the similarity value and the similarity threshold value; the basic knowledge point labels included in the basic knowledge point label set are as follows: basic knowledge point labels associated with target knowledge points with similarity values greater than a similarity threshold;
generating a comprehensive knowledge point tag set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point tag set; the characteristic information refers to the type of the original data resource, and the comprehensive knowledge point labels in the comprehensive knowledge point label set refer to knowledge point labels which are generated based on the characteristic information of the original data resource and the basic knowledge point label set and can reflect the exercise user capacity of the original data resource;
The generating the basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold value comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain sentence sets, and the sentence sets are respectively subjected to block segmentation processing to obtain word block sets, and word segmentation processing is carried out to obtain word sets;
analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
when the similarity value of each corresponding target knowledge point is larger than the similarity threshold value, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set; or (b)
Sentence segmentation processing is carried out on the text data to obtain sentence sets, and word segmentation processing is respectively carried out on the sentence sets to obtain word sets;
analyzing the word set to obtain a phonetic symbol set, a part-of-speech labeling set and a dependency syntax tree respectively, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammar in the sentence set and corresponding target knowledge points respectively; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammar are target grammar;
and when the similarity value of each corresponding target knowledge point is larger than the similarity threshold value, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set.
2. The method of claim 1, wherein the method of determining the plurality of target knowledge points comprises:
inquiring attribute information corresponding to the original data resources from a preset knowledge graph to obtain a plurality of target knowledge points corresponding to the attribute information.
3. The method of claim 2, wherein the raw data resource is a teaching resource and the attribute information is course information.
4. The method of claim 1, wherein the target knowledge point comprises: a target content vocabulary;
the method comprises the steps of carrying out sentence segmentation on text data to obtain sentence sets, carrying out block segmentation on the sentence sets to obtain word block sets, and carrying out word segmentation to obtain word sets;
analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
When the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and block segmentation processing is carried out on the sentence set to obtain a word block set;
calculating importance degree weights of each word block in the word block set based on a keyword extraction TF-IDF algorithm;
taking the word blocks with the importance degree weight larger than a first preset weight as the reference content vocabulary;
calculating a similarity value of the reference content vocabulary and the target content vocabulary;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point tag associated with the target content vocabulary;
and adding the basic knowledge point labels associated with the target content vocabulary into the basic knowledge point label set.
5. The method of claim 1, wherein the target knowledge point comprises: target frequency vocabulary;
the method comprises the steps of carrying out sentence segmentation on text data to obtain sentence sets, carrying out block segmentation on the sentence sets to obtain word block sets, and carrying out word segmentation to obtain word sets;
Analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
when the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and block segmentation processing is carried out on the sentence set to obtain a word block set;
calculating importance degree weights of each word block in the word block set based on a keyword extraction TF-IDF algorithm;
Taking the word blocks with the importance degree weight less than or equal to a second preset weight as the reference high-frequency vocabulary;
calculating the similarity value of the reference high-frequency vocabulary and the target high-frequency vocabulary;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point tag associated with the target high-frequency vocabulary;
and adding the basic knowledge point labels associated with the target high-frequency words into the basic knowledge point label set.
6. The method of claim 1, wherein the target knowledge point comprises: target verb vocabulary;
the method comprises the steps of carrying out sentence segmentation on text data to obtain sentence sets, carrying out block segmentation on the sentence sets to obtain word block sets, and carrying out word segmentation to obtain word sets;
analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
When the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and word segmentation processing is carried out on the sentence set to obtain a word set;
performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
taking the words with the parts of speech being verbs as reference verb words;
calculating a similarity value of the reference verb vocabulary and the target verb vocabulary;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target verb vocabulary;
and adding the basic knowledge point labels associated with the target verb words into the basic knowledge point label set.
7. The method of claim 1, wherein the target knowledge point comprises: a target mathematical vocabulary;
the method comprises the steps of carrying out sentence segmentation on text data to obtain sentence sets, carrying out block segmentation on the sentence sets to obtain word block sets, and carrying out word segmentation to obtain word sets;
Analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
when the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and word segmentation processing is carried out on the sentence set to obtain a word set;
performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
Taking the words with the parts of speech being words with the parts of speech as reference mathematical words;
calculating a similarity value of the reference mathematical vocabulary and the target mathematical vocabulary;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target mathematical vocabulary;
and adding the basic knowledge point labels associated with the target mathematics into the basic knowledge point label set.
8. The method of claim 1, wherein the target knowledge point comprises: target phonetic symbols;
the text data is subjected to sentence segmentation processing to obtain sentence sets, and word segmentation processing is respectively carried out on the sentence sets to obtain word sets;
analyzing the word set to obtain a phonetic symbol set, a part-of-speech labeling set and a dependency syntax tree respectively, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammar in the sentence set and corresponding target knowledge points respectively; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammar are target grammar;
When the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and word segmentation processing is carried out on the sentence set to obtain a word set;
analyzing each word in the word set, and marking phonetic symbols on each word to obtain a phonetic symbol set;
calculating similarity values of word phonetic symbols in the phonetic symbol set and the target phonetic symbol;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target phonetic symbol;
and adding the basic knowledge point labels associated with the target phonetic symbols into the basic knowledge point label set.
9. The method of claim 1, wherein the target knowledge point comprises: a target sentence pattern;
the text data is subjected to sentence segmentation processing to obtain sentence sets, and word segmentation processing is respectively carried out on the sentence sets to obtain word sets;
analyzing the word set to obtain a phonetic symbol set, a part-of-speech labeling set and a dependency syntax tree respectively, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammar in the sentence set and corresponding target knowledge points respectively; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammar are target grammar;
When the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and word segmentation processing is carried out on the sentence set to obtain a word set;
performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
performing dependency syntax analysis on sentences in the sentence set to obtain a dependency syntax tree;
calculating similarity values of the words in the word set, the parts of speech in the part-of-speech tagging set and the dependency syntax tree corresponding to the words, the parts of speech and the syntax tree in the target sentence pattern respectively;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target sentence pattern;
and adding the basic knowledge point labels associated with the target sentence patterns into the basic knowledge point label set.
10. The method of claim 1, wherein the target knowledge point comprises: target grammar;
the text data is subjected to sentence segmentation processing to obtain sentence sets, and word segmentation processing is respectively carried out on the sentence sets to obtain word sets;
Analyzing the word set to obtain a phonetic symbol set, a part-of-speech labeling set and a dependency syntax tree respectively, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammar in the sentence set and corresponding target knowledge points respectively; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammar are target grammar;
when the similarity value of each corresponding target knowledge point is greater than each corresponding similarity threshold, adding the basic knowledge point label corresponding to each corresponding target knowledge point into the basic knowledge point label set, wherein the method comprises the following steps:
sentence segmentation processing is carried out on the text data to obtain a sentence set, and word segmentation processing is carried out on the sentence set to obtain a word set;
performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;
calculating the similarity value of grammar contained in the sentence set and the target grammar based on the word set and the part-of-speech tagging set;
when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point tag associated with the target grammar;
And adding the basic knowledge point labels associated with the target grammar into the basic knowledge point label set.
11. A device for labeling data resources, the device comprising:
the preprocessing module is used for preprocessing the original data resources to obtain text data;
the calculation module is used for calculating the similarity between the text data and a plurality of target knowledge points to obtain a similarity value; wherein, each of the plurality of target knowledge points is associated with a basic knowledge point tag;
the first processing module is used for generating a basic knowledge point tag set of the original data resource according to a comparison result of the similarity value and the similarity threshold value; the basic knowledge point labels included in the basic knowledge point label set are as follows: basic knowledge point labels associated with target knowledge points with similarity values greater than a similarity threshold;
the second processing module is used for generating a comprehensive knowledge point tag set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point tag set; the characteristic information refers to the type of the original data resource, and the comprehensive knowledge point labels in the comprehensive knowledge point label set refer to knowledge point labels which are generated based on the characteristic information of the original data resource and the basic knowledge point label set and can reflect the exercise user capacity of the original data resource;
The first processing module includes:
the first processing unit is used for carrying out sentence segmentation processing on the text data to obtain sentence sets, respectively carrying out block segmentation processing on the sentence sets to obtain word block sets, and carrying out word segmentation processing to obtain word sets;
the first calculation unit is used for analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of each reference word in the reference word set and each corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference math vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference math vocabulary is a target math vocabulary;
the first adding unit is used for adding the basic knowledge point labels corresponding to the respective corresponding target knowledge points into the basic knowledge point label set when the similarity value of the respective corresponding target knowledge points is larger than the respective corresponding similarity threshold value; or (b)
The second processing unit is used for carrying out sentence segmentation processing on the text data to obtain sentence sets, and carrying out word segmentation processing on the sentence sets to obtain word sets;
the second calculation unit is used for analyzing the word set to respectively obtain a phonetic symbol set, a part-of-speech labeling set and a dependency syntax tree, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammar in the sentence set and corresponding target knowledge points; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammar are target grammar;
and the second adding unit is used for adding the basic knowledge point labels corresponding to the respective corresponding target knowledge points into the basic knowledge point label set when the similarity value of the respective corresponding target knowledge points is larger than the respective corresponding similarity threshold value.
12. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 10.
13. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-10.
CN202010580828.4A 2020-06-23 2020-06-23 Labeling method and device for data resources, storage medium and electronic equipment Active CN111930792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010580828.4A CN111930792B (en) 2020-06-23 2020-06-23 Labeling method and device for data resources, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010580828.4A CN111930792B (en) 2020-06-23 2020-06-23 Labeling method and device for data resources, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN111930792A CN111930792A (en) 2020-11-13
CN111930792B true CN111930792B (en) 2024-04-12

Family

ID=73316724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010580828.4A Active CN111930792B (en) 2020-06-23 2020-06-23 Labeling method and device for data resources, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111930792B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836013B (en) * 2021-01-29 2024-08-02 北京大米科技有限公司 Data labeling method and device, readable storage medium and electronic equipment
CN113569007B (en) * 2021-06-18 2024-06-21 武汉理工数字传播工程有限公司 Method, device and storage medium for processing knowledge service resources
CN113536777A (en) * 2021-07-30 2021-10-22 深圳豹耳科技有限公司 Extraction method, device and equipment of news keywords and storage medium
CN114036907B (en) * 2021-11-18 2024-06-25 国网江苏省电力有限公司电力科学研究院 Text data amplification method based on field characteristics
CN114443903A (en) * 2021-12-28 2022-05-06 新瑞鹏宠物医疗集团有限公司 Method for extracting video label and related product
CN114676774B (en) * 2022-03-25 2025-01-07 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN114492419B (en) * 2022-04-01 2022-08-23 杭州费尔斯通科技有限公司 Text labeling method, system and device based on newly added key words in labeling
CN116029284B (en) * 2023-03-27 2023-07-21 上海蜜度信息技术有限公司 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090955A (en) * 2014-07-07 2014-10-08 科大讯飞股份有限公司 Automatic audio/video label labeling method and system
CN105956144A (en) * 2016-05-13 2016-09-21 安徽教育网络出版有限公司 Method for quantitatively calculating association degree among multi-tab learning resources
CN110162591A (en) * 2019-05-22 2019-08-23 南京邮电大学 A kind of entity alignment schemes and system towards digital education resource

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437837B2 (en) * 2015-10-09 2019-10-08 Fujitsu Limited Generating descriptive topic labels

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090955A (en) * 2014-07-07 2014-10-08 科大讯飞股份有限公司 Automatic audio/video label labeling method and system
CN105956144A (en) * 2016-05-13 2016-09-21 安徽教育网络出版有限公司 Method for quantitatively calculating association degree among multi-tab learning resources
CN110162591A (en) * 2019-05-22 2019-08-23 南京邮电大学 A kind of entity alignment schemes and system towards digital education resource

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于集成学习的试题多知识点标注方法;郭崇慧;吕征达;;运筹与管理;20200225(第02期);全文 *

Also Published As

Publication number Publication date
CN111930792A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
US9792278B2 (en) Method for identifying verifiable statements in text
US11651015B2 (en) Method and apparatus for presenting information
US20150170051A1 (en) Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation
CN111651497B (en) User tag mining method and device, storage medium and electronic equipment
CN107273861A (en) Subjective question marking and scoring method and device and terminal equipment
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN111695338A (en) Interview content refining method, device, equipment and medium based on artificial intelligence
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN108121699A (en) For the method and apparatus of output information
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN111695337A (en) Method, device, equipment and medium for extracting professional terms in intelligent interview
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN110738056B (en) Method and device for generating information
WO2024245081A1 (en) Model training method, text processing method and related device
CN117216275A (en) Text processing method, device, equipment and storage medium
CN112115229A (en) Text intention recognition method, device and system and text classification system
CN118094239A (en) Image and text rating method, device and computer-readable storage medium
CN111555960A (en) Method for generating information
WO2022227196A1 (en) Data analysis method and apparatus, computer device, and storage medium
CN114942981A (en) Question-answer query method and device, electronic equipment and computer readable storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20250711

Address after: No. 902, 9th Floor, Unit 2, Building 1, No. 333 Jiqing 3rd Road, Chengdu High tech Zone, Chengdu Free Trade Zone, Sichuan Province 610000

Patentee after: Chengdu Yudi Technology Co.,Ltd.

Country or region after: China

Address before: Building 6, Huitong Times Square, 1 yaojiayuan South Road, Chaoyang District, Beijing 100025

Patentee before: BEIJING DA MI TECHNOLOGY Co.,Ltd.

Country or region before: China