CN109829375A

CN109829375A - A machine learning method, device, equipment and system

Info

Publication number: CN109829375A
Application number: CN201811617557.4A
Authority: CN
Inventors: 郑海刚; 吕旭涛; 王孝宇
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-05-31

Abstract

The present application discloses a machine learning method, device, device and system. The system includes: a processing module for cyclically executing a target operation until a test result associated with data in a test data set satisfies a preset condition; the processing module includes: a labeling module , training module and calculation module; storage module, used to store the obtained data set and test data set; labeling module, used to label the data if the data in the data set is not labelled; training module, used to label the data After feature extraction is performed on the data that is greater than or equal to the first preset threshold, the extracted features are input into the pre-training model to obtain a trained training model; the computing module is used to input the data in the test data set into the trained model. Train the model, obtain the test results associated with the data in the test data set, and evaluate the test results to evaluate whether the test results meet the preset conditions. By adopting the present application, the model training speed can be improved.

Description

A machine learning method, device, equipment and system

技术领域technical field

本申请涉及机器学习技术领域，尤其涉及一种机器学习方法、装置、设备及系统。The present application relates to the technical field of machine learning, and in particular, to a machine learning method, apparatus, device and system.

背景技术Background technique

随着数据规模、计算能力、存储能力的增长，人工智能发展迅猛，被广泛应用于各种领域以挖掘数据的价值。而在快速发展的背后是大量的人力投入，业内流传的“人工智能有多少人工就有多少智能”，指的就是在人工智能技术与具体的业务结合之前需要投入大量的人力，首先需要大规模的数据集，然后需要大量的人力标注数据，标注之后需要算法工程师花费大量的时间编写代码进行训练及调参，最后输出一个模型。一次完整的机器学习过程包含数据集标注、数据集处理、特征处理、模型训练、模型评估以及模型部署等。为降低机器学习门槛，提高模型训练效率，在机器学习框架之上如tensorflow之上可搭建机器学习平台，提供友好的交互方式，支持流程化作业，如第四范式的先知平台，通过拖拉组件的方式生成训练的有向无环图。With the growth of data scale, computing power, and storage capacity, artificial intelligence has developed rapidly and has been widely used in various fields to mine the value of data. Behind the rapid development is a large amount of human input. The industry circulating "artificial intelligence has as much intelligence as there are artificial intelligence", which refers to the need to invest a lot of manpower before combining artificial intelligence technology with specific businesses. First, large-scale After labeling, algorithm engineers spend a lot of time writing code for training and parameter adjustment, and finally outputting a model. A complete machine learning process includes dataset labeling, dataset processing, feature processing, model training, model evaluation, and model deployment. In order to lower the threshold of machine learning and improve the efficiency of model training, a machine learning platform can be built on top of a machine learning framework such as tensorflow, providing a friendly interaction method and supporting process operations. way to generate a trained directed acyclic graph.

但在实际生产环境当中，一个能够产品化的模型(主要是无监督学习)往往需要多轮训练迭代，需要采集然后标注信息的数据集再次训练以提升模型精度，即是一个闭环的过程，若每次迭代都从将无标签的原始数据导入标注系统，再从标注系统导出，再导入机器学习系统，操作复杂，且大规模数据(TB级)在不同的系统间传输来传输去，传输速度慢，易容易出现网络中断，在不同系统存储多份相同数据，存在冗余，浪费存储空间，当前标注都是人工标注，费时费力，效率低。However, in the actual production environment, a model that can be commercialized (mainly unsupervised learning) often requires multiple rounds of training iterations, and it is necessary to collect and label the data set for retraining to improve the accuracy of the model, which is a closed-loop process. Each iteration starts from importing unlabeled raw data into the labeling system, then exporting it from the labeling system, and then importing it into the machine learning system. The operation is complicated, and the large-scale data (TB level) is transmitted to and from different systems, and the transmission speed is It is slow and prone to network interruptions. Multiple copies of the same data are stored in different systems, which is redundant and wastes storage space. The current annotation is manual, which is time-consuming, labor-intensive, and inefficient.

发明内容SUMMARY OF THE INVENTION

本申请提供一种机器学习方法、装置、设备及系统，可实现标注模块和训练模块共享存储，一方面，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，减少了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。The present application provides a machine learning method, device, equipment and system, which can realize shared storage between the labeling module and the training module. On the one hand, by sharing the data, data redundancy is reduced, and the utilization rate of storage space is greatly improved. On the one hand, it reduces the need for a large number of data sets to be transmitted between different systems (in the prior art, the labeling systems and training systems are far apart in space), which reduces the delay of data transmission and improves the efficiency of model training. In addition, the machine automatically labels the data in the unlabeled dataset, which improves the labeling speed of the data in the dataset.

第一方面，本申请提供了一种机器学习系统，该系统包括：In a first aspect, the present application provides a machine learning system, the system comprising:

处理模块，用于循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；所述处理模块，包括：标注模块、训练模块以及计算模块；a processing module, used for cyclically executing the target operation until the test result associated with the data in the test data set satisfies the preset condition; the processing module includes: a labeling module, a training module and a computing module;

存储模块，用于存储获取到的数据集以及测试数据集；The storage module is used to store the obtained data set and test data set;

所述标注模块，用于如果所述获取到的数据集中数据未被标注，则对所述数据集中数据进行标注；The labeling module is configured to label the data in the data set if the data in the obtained data set is not labelled;

所述训练模块，用于在将标注后的大于或等于第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取之后，将提取出的特征输入到预训练模型，获得训练好的训练模型；所述特征用于对所述预训练模型进行训练；The training module is used to screen the marked data that is greater than or equal to the first preset threshold, and then perform feature extraction on the screened data, and then input the extracted features into the pre-training model to obtain a trained model. training a model; the feature is used to train the pre-training model;

所述计算模块，用于在将所述获取到的测试数据集中数据输入到所述训练好的训练模型，获得所述测试数据集中数据关联的测试结果之后，对所述测试数据集中数据关联的测试结果进行评估，以评估出所述测试数据集中数据关联的测试结果是否满足预设条件。The computing module is configured to, after inputting the data in the obtained test data set into the trained training model, and obtaining the test result associated with the data in the test data set, perform the data associated with the data in the test data set. The test result is evaluated to evaluate whether the test result associated with the data in the test data set satisfies the preset condition.

结合第一方面方法，在一些可选的实施例中，In combination with the method of the first aspect, in some optional embodiments,

所述判断所述数据集中数据是否被标注之后，After judging whether the data in the data set is marked,

所述训练模块，还用于如果所述数据集中数据全都被标注，则对所述数据集中数据进行特征提取，将提取出的特征输入到预训练模型，获得训练好的训练模型；所述特征用于对所述预训练模型进行训练；The training module is further configured to perform feature extraction on the data in the data set if all the data in the data set are marked, and input the extracted features into the pre-training model to obtain a trained training model; the features for training the pre-training model;

所述计算模块，还用于在将所述获取到的测试数据集中数据输入到所述训练好的训练模型，获得所述测试数据集中数据关联的测试结果之后，对所述测试数据集中数据关联的测试结果进行评估，以评估出所述测试数据集中数据关联的测试结果是否满足预设条件。The computing module is further configured to, after inputting the obtained data in the test data set into the trained training model, and obtaining the test result of the data association in the test data set, correlate the data in the test data set The test results are evaluated to evaluate whether the test results associated with the data in the test data set meet the preset conditions.

所述标注模块，用于如果所述数据集中数据未被标注，则通过机器标注方式对所述数据集中数据进行标注。The labeling module is configured to label the data in the data set by means of machine labeling if the data in the data set is not labelled.

所述标注模块，还用于在通过机器标注方式对所述数据集中数据进行标注之后，将所述数据进行可视化。The labeling module is further configured to visualize the data after labeling the data in the data set by means of machine labeling.

所述训练模块，用于将标注后的置信度大于或等于第一预设阈值的数据进行筛选。The training module is used for screening the data whose marked confidence is greater than or equal to the first preset threshold.

所述计算模块，用于对测试数据集中数据关联的测试结果进行评估，以评估出所述测试数据集中数据关联的测试结果与所述测试数据关联的预设参考结果之间的差值是否满足小于第二预设阈值。The computing module is used to evaluate the test results associated with the data in the test data set, so as to evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the test data satisfies less than the second preset threshold.

第二方面，本申请提供了一种机器学习方法，该装置包括:In a second aspect, the application provides a machine learning method, the device comprising:

循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；所述目标操作包括：The target operation is performed cyclically until the test result associated with the data in the test data set satisfies the preset condition; the target operation includes:

获取数据集；get the dataset;

判断所述数据集中数据是否被标注，如果所述数据集中数据未被标注，则对所述数据集中数据进行标注；Determine whether the data in the data set is marked, and if the data in the data set is not marked, mark the data in the data set;

将标注后的满足第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取；将提取出的特征输入到预训练模型，获得训练好的训练模型；所述特征用于对所述预训练模型进行训练；Screen the marked data that meets the first preset threshold, and perform feature extraction on the screened data; input the extracted features into the pre-training model to obtain a trained training model; the features are used to Pre-trained model for training;

将获取的测试数据集中数据输入到所述训练好的训练模型，获得所述测试数据集中数据关联的测试结果；所述测试数据集中数据为已被标注的数据；Input the data in the acquired test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data;

对所述测试数据集中数据关联的测试结果进行评估，以评估出所述测试数据集中数据关联的测试结果是否满足预设条件。Evaluate the test results associated with the data in the test data set to evaluate whether the test results associated with the data in the test data set satisfy a preset condition.

第三方面，本申请提供了一种机器学习装置，该装置包括:In a third aspect, the application provides a machine learning device, the device comprising:

处理单元，用于循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；所述目标操作；a processing unit, configured to cyclically execute a target operation until the test result associated with the data in the test data set satisfies a preset condition; the target operation;

所述处理单元包括：The processing unit includes:

第一获取单元、判断单元、标注单元、筛选单元、提取单元、训练单元、The first acquisition unit, judgment unit, labeling unit, screening unit, extraction unit, training unit,

第二获取单元以及评估单元；a second acquisition unit and an evaluation unit;

所述第一获取单元，用于获取数据集；the first obtaining unit, used to obtain a data set;

所述判断单元，用于判断所述数据集中数据是否被标注；The judging unit is used to judge whether the data in the data set is marked;

所述标注单元，用于如果所述数据集中数据未被标注，则对所述数据集中数据进行标注；The labeling unit is configured to label the data in the data set if the data in the data set is not labelled;

所述筛选单元，用于将标注后的满足第一预设阈值的数据进行筛选；The screening unit is configured to screen the marked data that meets the first preset threshold;

所述提取单元，用于将筛选出的数据进行特征提取；The extraction unit is used to perform feature extraction on the filtered data;

所述训练单元，用于将提取出的特征输入到预训练模型，获得训练好的训练模型；所述特征用于对所述预训练模型进行训练；The training unit is used to input the extracted features into the pre-training model to obtain a trained training model; the features are used to train the pre-training model;

所述第二获取单元，用于将获取的测试数据集中数据输入到所述训练好的训练模型，获得所述测试数据集中数据关联的测试结果；所述测试数据集中数据为已被标注的数据；The second obtaining unit is configured to input the obtained data in the test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data ;

所述评估单元，用于对所述测试数据集中数据关联的测试结果进行评估，以评估出所述测试数据集中数据关联的测试结果是否满足预设条件。The evaluation unit is configured to evaluate the test results associated with the data in the test data set, so as to evaluate whether the test results associated with the data in the test data set satisfy a preset condition.

第四方面，本申请提供了一种设备，包括输入设备、输出设备、处理器和存储器，所述处理器、输入设备、输出设备和存储器相互连接，其中，所述存储器用于存储支持设备执行上述方法的应用程序代码，所述处理器被配置用于执行上述第二方面提供的机器学习方法。In a fourth aspect, the present application provides a device including an input device, an output device, a processor, and a memory, wherein the processor, the input device, the output device, and the memory are connected to each other, wherein the memory is used to store and support the execution of the device. In the application code of the above method, the processor is configured to execute the machine learning method provided in the above second aspect.

第五方面，本申请提供了一种计算机可读的存储介质，用于存储一个或多个计算机程序，上述一个或多个计算机程序包括指令，当上述计算机程序在计算机上运行时，上述指令用于执行上述第二方面提供的机器学习方法。In a fifth aspect, the present application provides a computer-readable storage medium for storing one or more computer programs, wherein the one or more computer programs include instructions, and when the above-mentioned computer programs are run on a computer, the above-mentioned instructions use for executing the machine learning method provided in the second aspect.

第六方面，本申请提供了一种计算机程序，该计算机程序包括机器学习指令，当该计算机程序在计算机上执行时，上述利用机器学习指令用于执行上述第二方面提供的机器学习方法。In a sixth aspect, the present application provides a computer program, the computer program includes machine learning instructions, and when the computer program is executed on a computer, the above-mentioned machine learning instructions are used to execute the machine learning method provided in the second aspect.

本申请提供了一种机器学习方法、装置、设备及系统。其中，系统包括：处理模块，用于循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；处理模块，包括：标注模块、训练模块以及计算模块；存储模块，用于存储获取到的数据集以及测试数据集；标注模块，用于如果获取到的数据集中数据未被标注，则对数据集中数据进行标注；训练模块，用于在将标注后的满足第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取之后，将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对所述预训练模型进行训练；计算模块，用于在将获取到的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果之后，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。采用本申请，一方面，实现了标注模块和训练模块共享存储的数据(如共享图片等需占用较大存储空间的数据集中数据)，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，检索了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。The present application provides a machine learning method, apparatus, device and system. Wherein, the system includes: a processing module, which is used for cyclically executing the target operation until the test result associated with the data in the test data set satisfies the preset condition; the processing module includes: a labeling module, a training module and a computing module; and a storage module, which is used for storing and obtaining The obtained data set and the test data set; the labeling module is used to label the data in the data set if the data in the obtained data set is not labelled; the training module is used to label the data that meets the first preset threshold after labeling. The data is screened, and after feature extraction is performed on the screened data, the extracted features are input into the pre-training model to obtain a trained training model; the features are used to train the pre-training model; the computing module is used for Input the obtained data in the test dataset into the trained training model, and after obtaining the test results associated with the data in the test dataset, evaluate the test results associated with the data in the test dataset to evaluate the test results associated with the data in the test dataset Whether the preset conditions are met. By adopting the present application, on the one hand, the data stored in the labeling module and the training module are shared (such as shared pictures and other data in a dataset that needs to occupy a large storage space). By sharing this data, data redundancy is reduced and greatly improved The utilization rate of storage space, on the other hand, reduces the need for a large amount of data in the data set to be transmitted between different systems (in the prior art, the labeling system and the training system which are far apart in space and are independent of each other), when the retrieval of data transmission In addition, the machine automatically labels the data in the unlabeled data set, which improves the labeling speed of the data in the data set.

附图说明Description of drawings

为了更清楚地说明本申请实施例技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

图1是本申请提供的一种机器学习系统的架构示意图；1 is a schematic diagram of the architecture of a machine learning system provided by the application;

图2是本申请提供的一种机器学习方法的示意流程图；2 is a schematic flowchart of a machine learning method provided by the present application;

图3是本申请提供的另一种机器学习方法的示意流程图；3 is a schematic flowchart of another machine learning method provided by the present application;

图4是本申请实施例提供的一种装置的示意性框图；4 is a schematic block diagram of an apparatus provided by an embodiment of the present application;

图5是本申请提供的一种设备示意性框图。FIG. 5 is a schematic block diagram of a device provided by the present application.

具体实施方式Detailed ways

下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the present application will be clearly and completely described below with reference to the accompanying drawings in the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

应当理解，当在本说明书和所附权利要求书中使用时，术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在，但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It is to be understood that, when used in this specification and the appended claims, the terms "comprising" and "comprising" indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or The presence or addition of a number of other features, integers, steps, operations, elements, components, and/or sets thereof.

还应当理解，在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样，除非上下文清楚地指明其它情况，否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in the specification of the application herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise.

还应当进一步理解，在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合，并且包括这些组合。It should also be further understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items .

如在本说明书和所附权利要求书中所使用的那样，术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地，短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting" . Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

具体实现中，本申请中描述的设备包括但不限于诸如具有触摸敏感表面(例如，触摸屏显示器和/或触摸板)的移动电话、膝上型计算机或平板计算机之类的其它便携式设备。还应当理解的是，在某些实施例中，所述设备并非便携式通信设备，而是具有触摸敏感表面(例如，触摸屏显示器和/或触摸板)的台式计算机。In specific implementations, the devices described in this application include, but are not limited to, other portable devices such as mobile phones, laptops, or tablet computers with touch-sensitive surfaces (eg, touchscreen displays and/or touchpads). It should also be understood that in some embodiments, the device is not a portable communication device, but rather a desktop computer with a touch-sensitive surface (eg, a touch screen display and/or a touch pad).

在接下来的讨论中，描述了包括显示器和触摸敏感表面的设备。然而，应当理解的是，设备可以包括诸如物理键盘、鼠标和/或控制杆的一个或多个其它物理用户接口设备。In the discussion that follows, a device including a display and a touch-sensitive surface is described. It should be understood, however, that the device may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

设备支持各种应用程序，例如以下中的一个或多个：绘图应用程序、演示应用程序、文字处理应用程序、网站创建应用程序、盘刻录应用程序、电子表格应用程序、游戏应用程序、电话应用程序、视频会议应用程序、电子邮件应用程序、即时消息收发应用程序、锻炼支持应用程序、照片管理应用程序、数码相机应用程序、数字摄影机应用程序、web浏览应用程序、数字音乐播放器应用程序和/或数字视频播放器应用程序。The device supports a variety of applications, such as one or more of the following: drawing applications, presentation applications, word processing applications, website creation applications, disk burning applications, spreadsheet applications, gaming applications, telephony applications programs, video conferencing applications, email applications, instant messaging applications, exercise support applications, photo management applications, digital camera applications, digital video camera applications, web browsing applications, digital music player applications and / or a digital video player application.

可以在设备上执行的各种应用程序可以使用诸如触摸敏感表面的至少一个公共物理用户接口设备。可以在应用程序之间和/或相应应用程序内调整和/或改变触摸敏感表面的一个或多个功能以及设备上显示的相应信息。这样，设备的公共物理架构(例如，触摸敏感表面)可以支持具有对用户而言直观且透明的用户界面的各种应用程序。Various applications that may execute on the device may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the device may be adjusted and/or changed between applications and/or within respective applications. In this way, the common physical architecture of the device (eg, touch-sensitive surfaces) can support various applications with user interfaces that are intuitive and transparent to the user.

为了更好地理解本申请，下面对本申请适用的机器学习系统的架构图进行描述。请参阅图1，图1是本申请提供一种机器学习系统的架构图。如图1所示，系统可包括但不限于：网络产品界面设计(Website User Interface，WebUi)、工作流模块、训练模块、标注模块、存储模块以及计算模块等。其中，工作流模块可包括：状态转移接口(Restful Api)、操作引擎(Operator Engine)、有向无环图引擎(DAG Engine)、数据库及执行器等。In order to better understand the present application, the following describes the architecture diagram of the machine learning system to which the present application applies. Please refer to FIG. 1. FIG. 1 is an architecture diagram of a machine learning system provided by the present application. As shown in FIG. 1 , the system may include, but is not limited to, a web product interface design (Website User Interface, WebUi), a workflow module, a training module, a labeling module, a storage module, and a computing module. The workflow module may include: a state transition interface (Restful Api), an operation engine (Operator Engine), a directed acyclic graph engine (DAG Engine), a database, an executor, and the like.

应当说明的，机器学习系统中，训练模块所利用的用于训练预训练模型的数据集的获取，可包括但不限于以下工作步骤：It should be noted that, in the machine learning system, the acquisition of the data set used by the training module for training the pre-training model may include but not limited to the following steps:

步骤1：基于工作流模块中http请求的状态转移接口，通过浏览器等互联网(WorldWide Web，Web)端拖拉组件形成工作流。应当说明的，工作流包括数据集的获取，其中，数据集中数据的获取即为工作流中的一个任务。Step 1: Based on the state transfer interface of the http request in the workflow module, a workflow is formed by dragging and dropping components on the Internet (WorldWide Web, Web) end such as a browser. It should be noted that the workflow includes the acquisition of the data set, wherein the acquisition of the data in the data set is a task in the workflow.

步骤2：通过工作流模块中的有向无环图引擎对工作流进行解析，并将解析后的工作流存储到数据库。Step 2: Analyze the workflow through the directed acyclic graph engine in the workflow module, and store the parsed workflow in the database.

步骤3：当工作流模块中的操作引擎接收到有向无环图引擎发送的调度引擎发送的调度请求时，操作引擎从数据库中读取到包括数据集的获取等工作流中的任务。Step 3: When the operation engine in the workflow module receives the scheduling request sent by the scheduling engine sent to the acyclic graph engine, the operation engine reads tasks in the workflow including data set acquisition from the database.

步骤4：通过工作流模块中的执行器来执行操作引擎读取到的数据集的获取等任务。Step 4: Execute tasks such as acquiring the data set read by the operation engine through the executor in the workflow module.

步骤5：可将获取的数据集存储在存储模块。Step 5: The acquired data set can be stored in the storage module.

在数据集获取之后，可通过标注模块、训练模块以及计算模块等模块循环执行目标操作。其中，标注模块，用于如果获取到的数据集中数据未被标注，则对数据集中数据进行标注。After the data set is acquired, the target operation can be performed cyclically through modules such as the labeling module, the training module, and the computing module. The labeling module is used to label the data in the data set if the data in the obtained data set is not labelled.

训练模块，用于在将标注后的大于或等于第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取之后，将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练；The training module is used to filter the marked data that is greater than or equal to the first preset threshold, and then perform feature extraction on the filtered data, and then input the extracted features into the pre-training model to obtain a trained training model. ; Features are used to train the pre-trained model;

计算模块，用于在将获取到的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果之后，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。The calculation module is used to evaluate the test results associated with the data in the test data set after inputting the obtained data in the test data set into the trained training model and obtaining the test results associated with the data in the test data set, so as to evaluate the test data Whether the test results associated with the centralized data meet the preset conditions.

应当说明的，在判断数据集中数据是否被标注之后，It should be noted that after judging whether the data in the dataset is marked,

训练模块，还用于如果数据集中数据全都被标注，则对数据集中数据进行特征提取，将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练。The training module is also used to extract features from the data in the dataset if all the data in the dataset are marked, and input the extracted features into the pre-training model to obtain a trained training model; the features are used to train the pre-training model .

标注模块，具体用于可如果数据集中数据未被标注，则可通过机器标注方式对数据集中数据进行标注。The labeling module is specifically used to label the data in the data set by machine labeling if the data in the data set is not labelled.

标注模块，还用于在通过机器标注方式对数据集中数据进行标注之后，将所述数据进行可视化。The labeling module is further configured to visualize the data after labeling the data in the dataset by means of machine labeling.

训练模块，具体可用于将标注后的置信度大于或等于第一预设阈值的数据进行筛选。The training module can specifically be used to filter the data whose marked confidence is greater than or equal to the first preset threshold.

本申请实施例中，系统包括：处理模块，用于循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；处理模块，包括：标注模块、训练模块以及计算模块；存储模块，用于存储获取到的数据集以及测试数据集；标注模块，用于如果获取到的数据集中数据未被标注，则对数据集中数据进行标注；训练模块，用于在将标注后的满足第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取之后，将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对所述预训练模型进行训练；计算模块，用于在将获取到的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果之后，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。采用本申请，一方面，实现了标注模块和训练模块共享存储的数据(如共享图片等需占用较大存储空间的数据集中数据)，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，减少了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。In the embodiment of the present application, the system includes: a processing module for cyclically executing the target operation until the test result associated with the data in the test data set satisfies a preset condition; the processing module includes: a labeling module, a training module and a computing module; a storage module, It is used to store the obtained data set and test data set; the labeling module is used to label the data in the data set if the data in the obtained data set is not labelled; the training module is used to label the data that satisfies the first The data with the preset threshold is screened, and after feature extraction is performed on the screened data, the extracted features are input into the pre-training model to obtain a trained training model; the features are used to train the pre-training model; the computing module , used to input the obtained data in the test data set into the trained training model, and obtain the test results associated with the data in the test data set, and then evaluate the test results associated with the data in the test data set to evaluate the data in the test data set. Whether the associated test result meets the preset condition. By adopting the present application, on the one hand, the data stored in the labeling module and the training module are shared (such as shared pictures and other data in a dataset that needs to occupy a large storage space). By sharing this data, data redundancy is reduced and greatly improved The utilization rate of storage space, on the other hand, reduces the need for a large amount of data in the dataset to be transmitted between different systems (in the prior art, the labeling system and the training system that are far apart from each other in space), reducing the time required for data transmission. In addition, the machine automatically labels the data in the unlabeled data set, which improves the labeling speed of the data in the data set.

参见图2，是本申请提供一种机器学习方法的示意流程图，如图2所示，该方法可以至少包括以下几个步骤：Referring to FIG. 2, it is a schematic flowchart of a machine learning method provided by the present application. As shown in FIG. 2, the method may at least include the following steps:

S201、获取包含数据的数据集。S201. Acquire a data set containing data.

本申请实施例中的数据集中数据可包括但不限于：图片、视频、语音或文本等类型。The data in the data set in the embodiments of the present application may include, but are not limited to, types such as pictures, videos, voices, or texts.

另外，该数据可分为原始数据或带标签的数据。其中，标签是数据的属性，比如性别分类中的男女，标签是人类的经验知识，需要人为的提供。应当说明的，通过数据可学习得到一个训练好的训练模型，该训练好的训练模型可表示为某种函数映射，也即是说，训练模型的输入和输出存在某种对应关系，将经验数据(带标签的数据)输入到训练模型即可学习到这种对应关系，在面对新的测试数据时，训练好的训练模型会输出与该新的测试数据相关联的测试结果。Additionally, the data can be classified as raw data or labeled data. Among them, the label is the attribute of the data, such as men and women in the gender classification, and the label is the human experience knowledge, which needs to be provided artificially. It should be noted that a trained training model can be obtained through data learning, and the trained training model can be expressed as a function mapping, that is to say, there is a certain correspondence between the input and output of the training model, and the empirical data This correspondence can be learned by inputting (labeled data) into the training model. When faced with new test data, the trained training model will output the test results associated with the new test data.

结合图1，机器学习系统中，数据集的获取，可包括但不限于以下工作步骤：With reference to Figure 1, in the machine learning system, the acquisition of data sets may include but not limited to the following steps:

步骤3：当工作流模块中的操作引擎接收到有向无环图引擎发送的调度引擎发送的调度请求时，操作引擎从数据库中读取到工作流中的任务(如数据集中数据的获取)。Step 3: When the operation engine in the workflow module receives the scheduling request sent by the scheduling engine sent to the acyclic graph engine, the operation engine reads the tasks in the workflow from the database (such as the acquisition of data in the dataset) .

S202、判断数据集中数据是否被标注。S202. Determine whether the data in the data set is marked.

本申请实施例中，对数据集中数据进行标注，换句话说，对数据集中数据打标签。举例来说，在图像处理中，比如人的年龄识别，打标签是在一张图片上用框标注一个人的人脸，并且标记他的年龄；带不带口罩识别当中就是给在图片上框出人脸，标记是否带口罩。这些标记最终都是以数值的代号形式存储。In the embodiment of the present application, the data in the data set is labeled, in other words, the data in the data set is labeled. For example, in image processing, such as age recognition of people, labeling is to use a frame to mark a person's face on a picture and mark his age; in identification with or without a mask, it is to frame the picture. Show your face and mark whether you are wearing a mask. These tags are ultimately stored in the form of numerical codes.

应当说明的，判断数据集中数据是否被标注之后，还包括：It should be noted that after judging whether the data in the dataset is marked, it also includes:

如果数据集中数据全都被标注，则对数据集中数据进行特征提取；If all the data in the dataset are labeled, perform feature extraction on the data in the dataset;

将提取出的特征输入到预训练模型，获得训练好的训练模型；该特征用于对预训练模型进行训练；Input the extracted feature into the pre-training model to obtain a trained training model; the feature is used to train the pre-training model;

将测试数据集中数据输入到该训练好的训练模型，获得该测试数据集中数据关联的测试结果；Input the data in the test data set into the trained training model, and obtain the test results associated with the data in the test data set;

对该测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。The test result is evaluated to evaluate whether the test result associated with the data in the test dataset satisfies the preset condition.

具体的，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件，具体可包括以下过程：Specifically, the test results associated with the data in the test data set are evaluated to evaluate whether the test results associated with the data in the test data set meet the preset conditions, which may specifically include the following processes:

对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果与测试数据关联的预设参考结果之间的差值是否满足小于第二预设阈值。测试数据关联的预设参考结果为测试数据关联的预设参考结果。Evaluate the test results associated with the data in the test data set to evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the test data satisfies less than a second preset threshold. The preset reference result associated with the test data is the preset reference result associated with the test data.

举例来说，以第二预设阈值为5％为例，比如测试数据集中数据为一个人的人脸，且标记他的年龄为30岁。如果输出的测试结果为：根据该人的人脸，可确知该人的年龄达到30岁，且置信度为50％。当预设参考结果为：该人的人脸，可确知该人的年龄达到30岁，且置信度达95％时，测试数据集中数据关联的测试结果与测试数据关联的预设参考结果之间的差值不满足第二预设阈值。For example, taking the second preset threshold of 5% as an example, for example, the data in the test data set is a person's face, and his age is marked as 30 years old. If the output test result is: according to the person's face, it can be determined that the person's age is 30 years old, and the confidence level is 50%. When the preset reference result is: the face of the person, it can be determined that the age of the person is 30 years old, and the confidence level reaches 95%, the test result associated with the data in the test data set and the preset reference result associated with the test data are the same. The difference between them does not meet the second preset threshold.

S203、如果数据集中数据未被标注，则对数据集中数据进行标注。S203, if the data in the dataset is not marked, mark the data in the dataset.

本申请实施例中，具体的，对数据集中数据进行标注(打标签)，可包括但不限于以下两种方式。In the embodiment of the present application, specifically, marking (labeling) the data in the data set may include but not be limited to the following two ways.

方式1：如果数据集中数据未被标注，则通过机器对数据集中数据进行标注。Method 1: If the data in the data set is not labeled, the data in the data set is labeled by the machine.

方式2：如果数据集中数据未被标注，则通过人工标注方式对数据集中数据进行标注。Method 2: If the data in the dataset is not labeled, label the data in the dataset by manual labeling.

S204、将标注后的满足第一预设阈值的数据进行筛选。S204: Screen the marked data that meets the first preset threshold.

具体的，将标注后的满足第一预设阈值的数据进行筛选，具体可包括以下过程：Specifically, screening the marked data that meets the first preset threshold may specifically include the following process:

将标注后的置信度满足第一预设阈值的数据进行筛选。The data whose annotated confidence levels meet the first preset threshold are screened.

应当说明的，通过将标注后的置信度满足第一预设阈值的数据进行筛选，以筛选出置信度高的数据。It should be noted that data with high confidence is filtered out by filtering the data whose marked confidence meets the first preset threshold.

应当说明的，第一预设阈值可为90％。It should be noted that the first preset threshold may be 90%.

以年龄识别为例，在年龄识别中，输入一批人脸数据，可通过机器对上述人脸数据进行标注之后，该批人脸数据都会有年龄的标签。举例来说，当某个人的人脸数据被标注后，当该人的人脸的标签为20岁，置信度为71％，则该机器可认为该人有71％的可能性是20岁。在年龄识别中，为了筛选出置信度高的数据，当第一预设阈值为90％时，可筛选出置信度超过90％以上人脸数据，置信度低于90％的人脸数据是无效的，因为低于该置信度的人脸数据用来训练预训练模型，会得到训练效果不佳的训练模型。Taking age recognition as an example, in age recognition, a batch of face data is input, and after the above-mentioned face data can be marked by machine, the batch of face data will have age labels. For example, when a person's face data is labeled, when the label of the person's face is 20 years old and the confidence level is 71%, the machine can think that there is a 71% probability that the person is 20 years old. In age recognition, in order to filter out data with high confidence, when the first preset threshold is 90%, face data with a confidence level of more than 90% can be filtered out, and face data with a confidence level of less than 90% are invalid. , because the face data below the confidence level is used to train the pre-training model, and a training model with poor training effect will be obtained.

S205、将筛选出的数据进行特征提取。S205. Perform feature extraction on the filtered data.

本申请实施例中，可通过神经网络对筛选出的数据进行特征提取。其中，神经网络可包括但不限于：长短时记忆网络(Long Short-Term Memory,LSTM)、循环神经网络(Recurrent Neural Network，RNN)、深度卷积网络以及深度残差网络。In this embodiment of the present application, feature extraction may be performed on the filtered data through a neural network. The neural network may include, but is not limited to, a Long Short-Term Memory (LSTM), a Recurrent Neural Network (RNN), a deep convolutional network, and a deep residual network.

举例来说，可通过循环神经网络对筛选出的置信度超过90％的数据进行特征提取。For example, feature extraction can be performed on the filtered data with a confidence level of more than 90% through a recurrent neural network.

S206、将提取出的特征输入到预训练模型，获得训练好的训练模型；将预存的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果。S206. Input the extracted features into the pre-training model to obtain a trained training model; input the data in the pre-stored test data set into the trained training model to obtain a test result associated with the data in the test data set.

本申请实施例中，特征用于对预训练模型进行训练；测试数据集中数据为已被标注的数据。In the embodiment of the present application, the feature is used to train the pre-training model; the data in the test data set is the marked data.

应当说明的，以人脸数据为例，测试数据集中数据关联的测试结果，测试结果可为：根据A的人脸数据，可确知A的年龄达到30岁，且置信度为90％。It should be noted that, taking face data as an example, the test result of the data association in the test data set can be as follows: according to A's face data, it can be determined that A's age is 30 years old, and the confidence level is 90%.

S207、对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。S207: Evaluate the test results associated with the data in the test data set to evaluate whether the test results associated with the data in the test data set meet the preset conditions.

本申请实施例中，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件，如果测试数据集中数据关联的测试结果不满足预设条件，则循环执行步骤S201～S207，直至测试数据集中数据关联的测试结果满足预设条件。In the embodiment of the present application, the test results of the data association in the test data set are evaluated to evaluate whether the test results of the data association in the test data set meet the preset conditions, and if the test results of the data association in the test data set do not meet the preset conditions, Steps S201 to S207 are executed cyclically until the test result associated with the data in the test data set satisfies the preset condition.

具体的，对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件，可包括但不限于以下两种情况。Specifically, evaluating the test results associated with the data in the test data set to evaluate whether the test results associated with the data in the test data set meet the preset conditions may include but are not limited to the following two situations.

情况1：对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果与测试数据关联的预设参考结果之间的差值是否满足小于第二预设阈值。Case 1: Evaluate the test results associated with the data in the test data set to assess whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the test data satisfies less than a second preset threshold.

当测试集中数据为单个测试数据时，对该测试数据关联的测试结果与测试数据关联的预设参考结果之间的差值是否满足小于第二预设阈值。When the data in the test set is a single test data, whether the difference between the test result associated with the test data and the preset reference result associated with the test data is less than a second preset threshold.

在年龄识别的情形下，以测试数据为人脸数据、第二预设阈值为4％为例，根据该人的人脸，可确知该人的年龄达到50岁，且置信度为70％。当预设参考结果为：该人的人脸，可确知该人的年龄达到50岁，且置信度达96％时，测试数据集中数据关联的测试结果与测试数据关联的预设参考结果之间的差值不满足第二预设阈值。In the case of age recognition, taking the test data as face data and the second preset threshold of 4% as an example, according to the person's face, it can be determined that the person's age is 50 years old, and the confidence level is 70%. When the preset reference result is: the face of the person, it can be determined that the age of the person reaches 50 years old, and the confidence level reaches 96%, the test result associated with the data in the test data set and the preset reference result associated with the test data are the same. The difference between them does not meet the second preset threshold.

情况2：分别对测试数据集中数据关联的测试结果进行评估，以分别评估出测试数据集中数据关联的测试结果与测试数据集中数据关联的预设参考结果之间的差值是否都满足小于第二预设阈值。Situation 2: The test results associated with the data in the test data set are respectively evaluated, so as to separately evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the data in the test data set is all less than the second Preset threshold.

在年龄识别的情形下，以测试数据为多个人脸数据、第二预设阈值为7％为例，In the case of age recognition, taking the test data as multiple face data and the second preset threshold value of 7% as an example,

比如测试数据集中数据为一个人的人脸，且标记他的年龄为30岁。如果输出的测试结果为：根据该人的人脸，可确知该人的年龄达到30岁，且置信度为90％。当A的预设参考结果为：A的年龄达到30岁，且置信度达94％时；此时，A关联的测试结果与A的预设参考结果之间的差值满足小于第二预设阈值。For example, the data in the test data set is a person's face, and his age is marked as 30 years old. If the output test result is: based on the person's face, it can be determined that the person's age is 30 years old, and the confidence level is 90%. When A's preset reference result is: A's age reaches 30 years old, and the confidence level reaches 94%; at this time, the difference between the test result associated with A and A's preset reference result is less than the second preset reference result. threshold.

比如测试数据集中数据为一个人的人脸，且标记他的年龄为60岁。如果输出的测试结果为：根据该人的人脸，可确知该人的年龄达到60岁，且置信度为89％。当B的预设参考结果为：B的年龄达到60岁，且置信度达94％时；此时，B关联的测试结果与B的预设参考结果之间的差值满足小于第二预设阈值。For example, the data in the test data set is a person's face, and his age is marked as 60 years old. If the output test result is: based on the person's face, it can be determined that the person's age is 60 years old, and the confidence level is 89%. When the preset reference result of B is: B's age reaches 60 years old, and the confidence level reaches 94%; at this time, the difference between the test result associated with B and the preset reference result of B satisfies that it is smaller than the second preset reference result threshold.

综合A与B，可判断出训练好的模型输出的关于A的测试结果与关于A的理想结果之间的差值以及关于B的测试结果与关于B的理想结果之间的差值都满足小于第二预设阈值。By synthesizing A and B, it can be judged that the difference between the test result of A and the ideal result of A and the difference between the test result of B and the ideal result of B are all less than the second preset threshold.

下面介绍另外一种机器学习方法。Another machine learning method is introduced below.

图3示例性示出了一种机器学习方法的示意流程图。FIG. 3 exemplarily shows a schematic flow chart of a machine learning method.

如图3所示，循环执行以下目标操作，直至推导出的推导功能满足要求(直至训练模型的输出的精度足够高)。其中，目标操作包括：通过数据标记函数(Data LabelingFunction)对数据集(Dataset-1)中数据进行标记，获取标记后的数据；进而，通过预处理函数(Preprocessing Function)对上述标记后的数据进行特征提取，获取相应的特征；然后，通过将上述相应的特征输入到预训练模型(Pretraining Model)中，对预训练模型进行训练，获得训练好的训练模型(Training Function)；接着，将测试数据集(Test Dataset)输入到训练模型中，获得评价函数(Evaluation Function)，最后，一方面，对获得的评价函数进行部署，得到部署功能(Deploy Function)；另一方面，获得的评价函数结合Dataset-1获得推导功能(inference Function)。As shown in Figure 3, the following target operations are performed in a loop until the derived derivation function satisfies the requirements (until the accuracy of the output of the trained model is high enough). The target operation includes: labeling the data in the data set (Dataset-1) through a data labeling function (Data LabelingFunction) to obtain the labelled data; Feature extraction to obtain corresponding features; then, by inputting the above-mentioned corresponding features into the pre-training model (Pretraining Model), the pre-training model is trained to obtain a trained training model (Training Function); then, the test data are The Test Dataset is input into the training model to obtain the Evaluation Function. Finally, on the one hand, the obtained evaluation function is deployed to obtain the Deploy Function; on the other hand, the obtained evaluation function is combined with the Dataset. -1 to get the inference function.

综上所述，本申请实施例中，循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；目标操作包括：获取包含数据的数据集；判断数据集中数据是否被标注，如果数据集中数据未被标注，则对数据集中数据进行标注；将标注后的满足第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取；将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练；将预存的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果；测试数据集中数据为已被标注的数据；对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。采用本申请实施例，一方面，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，减少了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。To sum up, in the embodiment of the present application, the target operation is performed cyclically until the test result associated with the data in the test data set satisfies the preset condition; the target operation includes: acquiring a data set containing data; judging whether the data in the data set is marked, if If the data in the dataset is not marked, mark the data in the dataset; screen the marked data that meets the first preset threshold, and perform feature extraction on the filtered data; input the extracted features into the pre-training model, Obtain the trained training model; the features are used to train the pre-training model; input the data in the pre-stored test data set into the trained training model to obtain the test results associated with the data in the test data set; the data in the test data set is marked The test results associated with the data in the test data set are evaluated to evaluate whether the test results associated with the data in the test data set meet the preset conditions. Using the embodiments of the present application, on the one hand, by sharing the data, data redundancy is reduced, and the utilization rate of storage space is greatly improved; Spatially distant and independent labeling systems and training systems), the data transmission delay is reduced, and the speed of model training is improved. The speed of labeling the data in the dataset.

应当说明的，图3仅仅用于解释本申请实施例，不应对本申请做出限制。It should be noted that FIG. 3 is only used to explain the embodiment of the present application, and should not limit the present application.

可理解的，图2方法实施例中未提供的相关定义和说明可参考图1的实施例，此处不再赘述。It is understandable that for related definitions and descriptions not provided in the embodiment of the method in FIG. 2 , reference may be made to the embodiment in FIG. 1 , and details are not repeated here.

参见图4，是本申请提供的一种机器学习装置。如图4所示，机器学习装置40包括：第一获取单元401、判断单元402、标注单元403、筛选单元404、提取单元405、训练单元406、第二获取单元407以及评估单元408。其中：Referring to FIG. 4 , it is a machine learning device provided by the present application. As shown in FIG. 4 , the machine learning device 40 includes: a first obtaining unit 401 , a judging unit 402 , a labeling unit 403 , a screening unit 404 , an extracting unit 405 , a training unit 406 , a second obtaining unit 407 and an evaluating unit 408 . in:

第一获取单元401，用于获取数据集。The first obtaining unit 401 is used to obtain a data set.

判断单元402，用于判断数据集中数据是否被标注。The judging unit 402 is used for judging whether the data in the data set is marked.

标注单元403，用于如果数据集中数据未被标注，则对数据集中数据进行标注。The labeling unit 403 is configured to label the data in the data set if the data in the data set is not labelled.

筛选单元404，用于将标注后的满足第一预设阈值的数据进行筛选。The screening unit 404 is configured to screen the marked data satisfying the first preset threshold.

提取单元405，用于将筛选出的数据进行特征提取。The extraction unit 405 is configured to perform feature extraction on the filtered data.

训练单元406，用于将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练。The training unit 406 is configured to input the extracted features into the pre-training model to obtain a trained training model; the features are used to train the pre-training model.

第二获取单元407，用于将获取的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果；测试数据集中数据为已被标注的数据。The second obtaining unit 407 is configured to input the obtained data in the test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data.

评估单元408，用于对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。The evaluating unit 408 is configured to evaluate the test results associated with the data in the test data set, so as to evaluate whether the test results associated with the data in the test data set satisfy the preset condition.

标注单元403，具体可用于：The labeling unit 403 can be specifically used for:

如果所述数据集中数据未被标注，则通过机器标注方式对数据集中数据进行标注。If the data in the data set is not labeled, the data in the data set is labeled by means of machine labeling.

筛选单元404，具体用于：The screening unit 404 is specifically used for:

评估单元408，具体用于：The evaluation unit 408 is specifically used for:

对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果与测试数据关联的预设参考结果之间的差值是否满足小于第二预设阈值。Evaluate the test results associated with the data in the test data set to evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the test data satisfies less than a second preset threshold.

或者，or,

分别对测试数据集中数据关联的测试结果进行评估，以分别评估出测试数据集中数据关联的测试结果与测试数据集中数据关联的预设参考结果之间的差值是否都满足小于第二预设阈值。Evaluate the test results associated with the data in the test data set, respectively, to evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the data in the test data set both satisfies less than a second preset threshold .

综上所述，本申请实施例中，装置40通过处理单元循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；目标操作；处理单元包括：To sum up, in this embodiment of the present application, the device 40 cyclically executes the target operation through the processing unit until the test result associated with the data in the test data set satisfies the preset condition; the target operation; the processing unit includes:

第一获取单元401、判断单元402、标注单元403、筛选单元404、提取单元405、训练单元406、第二获取单元407以及评估单元408；第一获取单元401，用于获取数据集；判断单元402，用于判断数据集中数据是否被标注；标注单元403，用于如果数据集中数据未被标注，则对数据集中数据进行标注；筛选单元404，用于将标注后的满足第一预设阈值的数据进行筛选；提取单元405，用于将筛选出的数据进行特征提取；训练单元406，用于将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练；第二获取单元407，用于将获取的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果；测试数据集中数据为已被标注的数据；评估单元408，用于对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。采用本申请实施例，实现了标注单元和训练单元共享存储的数据，一方面，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，减少了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。The first acquisition unit 401, the judgment unit 402, the labeling unit 403, the screening unit 404, the extraction unit 405, the training unit 406, the second acquisition unit 407 and the evaluation unit 408; the first acquisition unit 401 is used to acquire the data set; the judgment unit 402, for judging whether the data in the data set is marked; the marking unit 403, for marking the data in the data set if the data in the data set is not marked; the screening unit 404, for marking the marked data satisfying the first preset threshold The extraction unit 405 is used for feature extraction from the screened data; the training unit 406 is used to input the extracted features into the pre-training model to obtain a trained training model; the features are used for pre-training The model is trained; the second acquisition unit 407 is used to input the acquired data in the test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data; the evaluation unit 408: Evaluate the test result of the data association in the test data set, so as to evaluate whether the test result of the data association in the test data set satisfies the preset condition. By adopting the embodiments of the present application, it is realized that the labeling unit and the training unit share the stored data. On the one hand, by sharing the data, data redundancy is reduced, and the utilization rate of the storage space is greatly improved. On the other hand, a large number of The data in the dataset needs to be transmitted between different systems (in the prior art, the labeling systems and training systems that are far apart from each other in space) reduce the delay of data transmission and improve the speed of model training. Automatic labeling of data in unlabeled datasets improves the labeling speed of data in datasets.

应当理解，装置40仅为本申请实施例提供的一个例子，并且，装置40可具有比示出的部件更多或更少的部件，可以组合两个或更多个部件，或者可具有部件的不同配置实现。It should be understood that device 40 is merely an example provided by the embodiments of the present application, and that device 40 may have more or fewer components than those shown, may combine two or more components, or may have a combination of components Different configurations are implemented.

可理解的，关于图4的装置40包括的功能块的具体实现方式，可参考前述图1、图2所述的实施例，这里不再赘述。It is understandable that, for the specific implementation of the functional blocks included in the apparatus 40 in FIG. 4 , reference may be made to the embodiments described in the foregoing FIG. 1 and FIG. 2 , and details are not repeated here.

图5是本申请提供的一种机器学习设备的结构示意图。本申请实施例中，设备可以包括移动手机、平板电脑、个人数字助理(PersonalDigital Assistant，PDA)、移动互联网设备(Mobile Internet Device，MID)、智能穿戴设备(如智能手表、智能手环)等各种设备，本申请实施例不作限定。如图5所示，设备50可包括：基带芯片501、存储器502(一个或多个计算机可读存储介质)、外围系统503。这些部件可在一个或多个通信总线504上通信。FIG. 5 is a schematic structural diagram of a machine learning device provided by the present application. In the embodiments of the present application, the devices may include mobile phones, tablet computers, personal digital assistants (Personal Digital Assistant, PDA), mobile Internet devices (Mobile Internet Device, MID), smart wearable devices (such as smart watches, smart wristbands), etc. This kind of equipment is not limited in this embodiment of the present application. As shown in FIG. 5 , the device 50 may include: a baseband chip 501 , a memory 502 (one or more computer-readable storage media), and a peripheral system 503 . These components may communicate on one or more communication buses 504 .

基带芯片501可包括：一个或多个处理器(CPU)505。The baseband chip 501 may include: one or more processors (CPUs) 505 .

处理器505，具体可用于：The processor 505 can be specifically used for:

获取数据集。Get the dataset.

判断数据集中数据是否被标注，如果数据集中数据未被标注，则对数据集中数据进行标注。Determine whether the data in the dataset is labeled, and if the data in the dataset is not labeled, label the data in the dataset.

将标注后的满足第一预设阈值的数据进行筛选，将筛选出的数据进行特征提取；将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练。Screen the marked data that meets the first preset threshold, and perform feature extraction on the screened data; input the extracted features into the pre-training model to obtain a trained training model; the features are used for the pre-training model. train.

将获取的测试数据集中数据输入到训练好的训练模型，获得测试数据集中数据关联的测试结果；测试数据集中数据为已被标注的数据。Input the obtained data in the test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data.

对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。Evaluate the test result of the data association in the test data set, so as to evaluate whether the test result of the data association in the test data set satisfies the preset condition.

存储器502与处理器505耦合，可用于存储各种软件程序和/或多组指令。具体实现中，存储器502可包括高速随机存取的存储器，并且也可包括非易失性存储器，例如一个或多个磁盘存储设备、闪存设备或其他非易失性固态存储设备。存储器502可以存储操作系统(下述简称系统)，例如ANDROID，IOS，WINDOWS，或者LINUX等嵌入式操作系统。存储器502还可以存储网络通信程序，该网络通信程序可用于与一个或多个附加设备，一个或多个设备设备，一个或多个网络设备进行通信。存储器502还可以存储用户接口程序，该用户接口程序可以通过图形化的操作界面将应用程序的内容形象逼真的显示出来，并通过菜单、对话框以及按键等输入控件接收用户对应用程序的控制操作。Memory 502 is coupled to processor 505 and may be used to store various software programs and/or sets of instructions. In particular implementations, memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. The memory 502 may store an operating system (hereinafter referred to as a system), such as an embedded operating system such as ANDROID, IOS, WINDOWS, or LINUX. Memory 502 may also store network communication programs that may be used to communicate with one or more additional devices, one or more device devices, one or more network devices. The memory 502 can also store a user interface program, which can vividly display the content of the application program through a graphical operation interface, and receive user control operations on the application program through input controls such as menus, dialog boxes, and keys. .

可理解的，存储器502可用于存储实现机器学习方法的实现代码。Understandably, the memory 502 may be used to store implementation codes for implementing the machine learning method.

存储器502还可以存储一个或多个应用程序。如图5所示，这些应用程序可包括：社交应用程序(例如Facebook)，图像管理应用程序(例如相册)，地图类应用程序(例如谷歌地图)，浏览器(例如Safari，Google Chrome)等等。Memory 502 may also store one or more application programs. As shown in Figure 5, these applications may include: social applications (eg Facebook), image management applications (eg photo albums), map applications (eg Google Maps), browsers (eg Safari, Google Chrome), etc. .

外围系统503主要用于实现设备50用户/外部环境之间的交互功能，主要包括设备50的输入输出装置。具体实现中，外围系统503可包括：显示屏控制器507、摄像头控制器508以及音频控制器509。其中，各个控制器可与各自对应的外围设备(如显示屏510、摄像头511以及音频电路52)耦合。在一些实施例中，显示屏可以配置有自电容式的悬浮触控面板的显示屏1，也可以是配置有红外线式的悬浮触控面板的显示屏。在一些实施例中，摄像头511可以是3D摄像头。需要说明的，外围系统503还可以包括其他I/O外设。The peripheral system 503 is mainly used to realize the interaction function between the user of the device 50 and the external environment, and mainly includes the input and output devices of the device 50 . In a specific implementation, the peripheral system 503 may include: a display screen controller 507 , a camera controller 508 and an audio controller 509 . Wherein, each controller can be coupled with its corresponding peripheral devices (such as the display screen 510 , the camera 511 and the audio circuit 52 ). In some embodiments, the display screen may be configured with a display screen 1 of a self-capacitive floating touch panel, or may be a display screen configured with an infrared floating touch panel. In some embodiments, the camera 511 may be a 3D camera. It should be noted that the peripheral system 503 may also include other I/O peripherals.

综上所述，本申请实施例中，设备50循环执行目标操作，直至测试数据集中数据关联的测试结果满足预设条件；目标操作包括：设备50通过处理器505获取数据集；设备50通过处理器505判断数据集中数据是否被标注，如果数据集中数据未被标注，设备50通过处理器505则对数据集中数据进行标注；设备50通过处理器505将标注后的满足第一预设阈值的数据进行筛选，设备50通过处理器505将筛选出的数据进行特征提取；设备50通过处理器505将提取出的特征输入到预训练模型，获得训练好的训练模型；特征用于对预训练模型进行训练；将获取的测试数据集中数据输入到训练好的训练模型，设备50通过处理器505获得测试数据集中数据关联的测试结果；测试数据集中数据为已被标注的数据；设备50通过处理器505对测试数据集中数据关联的测试结果进行评估，以评估出测试数据集中数据关联的测试结果是否满足预设条件。采用本申请实施例，可实现快速对数据集中数据进行标注和训练，一方面，通过共享该数据，减少了数据冗余，极大地提高了存储空间的利用率，另一方面，减少了大量的数据集中数据需要在不同系统(现有技术中空间上相距较远的、分别独立的标注系统和训练系统)间传输，减少了数据传输时延，提高了模型训练的速度，此外，还通过机器对未标注的数据集中数据进行自动化标注，提高了数据集中数据的标注速度。To sum up, in the embodiment of the present application, the device 50 performs the target operation cyclically until the test result associated with the data in the test data set satisfies the preset condition; the target operation includes: the device 50 obtains the data set through the processor 505; the device 50 obtains the data set through the processing The device 505 determines whether the data in the data set is marked. If the data in the data set is not marked, the device 50 uses the processor 505 to mark the data in the data set; the device 50 uses the processor 505 to mark the marked data that meets the first preset threshold. For screening, the device 50 performs feature extraction on the screened data through the processor 505; the device 50 inputs the extracted features into the pre-training model through the processor 505 to obtain a trained training model; the features are used for the pre-training model. training; input the acquired data in the test data set into the trained training model, and the device 50 obtains the test results associated with the data in the test data set through the processor 505; the data in the test data set is the marked data; the device 50 passes the processor 505 Evaluate the test result of the data association in the test data set, so as to evaluate whether the test result of the data association in the test data set satisfies the preset condition. Using the embodiments of the present application, it is possible to quickly label and train the data in the data set. On the one hand, by sharing the data, data redundancy is reduced, and the utilization rate of storage space is greatly improved; on the other hand, a large number of The data in the dataset needs to be transmitted between different systems (in the prior art, the labeling systems and training systems that are far apart from each other in space) reduce the delay of data transmission and improve the speed of model training. Automatic labeling of data in unlabeled datasets improves the labeling speed of data in datasets.

应当理解，设备50仅为本申请实施例提供的一个例子，并且，设备50可具有比示出的部件更多或更少的部件，可以组合两个或更多个部件，或者可具有部件的不同配置实现。It should be understood that device 50 is merely an example provided by the embodiments of the present application, and that device 50 may have more or fewer components than those shown, may combine two or more components, or may have a combination of components Different configurations are implemented.

可理解的，关于图5的设备50包括的功能模块的具体实现方式，可参考图1、图2的实施例，此处不再赘述。It is understandable that, for the specific implementation of the functional modules included in the device 50 in FIG. 5 , reference may be made to the embodiments in FIG. 1 and FIG. 2 , and details are not repeated here.

本申请提供一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，该计算机程序被处理器执行时实现。The present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is implemented when executed by a processor.

该计算机可读存储介质可以是前述任一实施例所述的设备的内部存储单元，例如设备的硬盘或内存。该计算机可读存储介质也可以是设备的外部存储设备，例如设备上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。进一步的，该计算机可读存储介质还可以既包括设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储计算机程序以及设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of the device described in any of the foregoing embodiments, such as a hard disk or a memory of the device. The computer-readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash card) equipped on the device. Card), etc. Further, the computer-readable storage medium may also include both an internal storage unit of the device and an external storage device. The computer-readable storage medium is used to store computer programs and other programs and data required by the device. The computer-readable storage medium can also be used to temporarily store data that has been or will be output.

本申请还提供一种计算机程序产品，该计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质，该计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包，该计算机包括电子装置。The present application also provides a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute any of the methods described in the above method embodiments some or all of the steps. The computer program product may be a software installation package, the computer including the electronic device.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的设备和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的设备和方法，可以通过其它的方式实现。例如，以描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the components and steps of each example are described. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

上述描述的设备实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另外，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、设备或单元的间接耦合或通信连接，也可以是电的，机械的或其它的形式连接。The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into Another system, or some features can be ignored, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, or may be electrical, mechanical or other forms of connection.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分，或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. a machine learning system, is characterized in that, comprises:

a processing module, used for cyclically executing the target operation until the test result associated with the data in the test data set satisfies the preset condition; the processing module includes: a labeling module, a training module and a computing module;

The storage module is used to store the obtained data set and test data set;

The labeling module is configured to label the data in the data set if the data in the obtained data set is not labelled;

The training module is used to screen the marked data that is greater than or equal to the first preset threshold, and then perform feature extraction on the screened data, and then input the extracted features into the pre-training model to obtain a trained model. training a model; the feature is used to train the pre-training model;

The computing module is configured to, after inputting the data in the obtained test data set into the trained training model, and obtaining the test result associated with the data in the test data set, perform the data associated with the data in the test data set. The test result is evaluated to evaluate whether the test result associated with the data in the test data set satisfies the preset condition.

2. The system of claim 1, wherein after judging whether the data in the data set is marked,

The training module is further configured to perform feature extraction on the data in the data set if all the data in the data set are marked, and input the extracted features into the pre-training model to obtain a trained training model; the features for training the pre-training model;

The computing module is further configured to, after inputting the obtained data in the test data set into the trained training model, and obtaining the test result of the data association in the test data set, correlate the data in the test data set The test results are evaluated to evaluate whether the test results associated with the data in the test data set meet the preset conditions.

3. The system of claim 1, wherein:

The labeling module is configured to label the data in the data set by means of machine labeling if the data in the data set is not labelled.

4. The system of claim 3, wherein:

The labeling module is further configured to visualize the data after labeling the data in the data set by means of machine labeling.

5. The system of claim 1, wherein:

The training module is used for screening the data whose marked confidence is greater than or equal to the first preset threshold.

6. The system of claim 1, wherein:

The computing module is used to evaluate the test results associated with the data in the test data set, so as to evaluate whether the difference between the test results associated with the data in the test data set and the preset reference results associated with the test data satisfies less than the second preset threshold.

7. A machine learning method, comprising:

The target operation is performed cyclically until the test result associated with the data in the test data set satisfies the preset condition; the target operation includes:

get the dataset;

Determine whether the data in the data set is marked, and if the data in the data set is not marked, mark the data in the data set;

Screen the marked data that meets the first preset threshold, and perform feature extraction on the screened data; input the extracted features into the pre-training model to obtain a trained training model; the features are used to Pre-trained model for training;

Input the data in the acquired test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data;

Evaluate the test results associated with the data in the test data set to evaluate whether the test results associated with the data in the test data set satisfy a preset condition.

8. A machine learning device, comprising:

a processing unit, configured to cyclically execute a target operation until the test result associated with the data in the test data set satisfies a preset condition; the target operation;

The processing unit includes:

a first acquisition unit, a judgment unit, an annotation unit, a screening unit, an extraction unit, a training unit, a second acquisition unit, and an evaluation unit;

the first obtaining unit, used to obtain a data set;

The judging unit is used to judge whether the data in the data set is marked;

The labeling unit is configured to label the data in the data set if the data in the data set is not labelled;

The screening unit is configured to screen the marked data that meets the first preset threshold;

The extraction unit is used to perform feature extraction on the filtered data;

The training unit is used to input the extracted features into the pre-training model to obtain a trained training model; the features are used to train the pre-training model;

The second obtaining unit is configured to input the obtained data in the test data set into the trained training model, and obtain the test result associated with the data in the test data set; the data in the test data set is the marked data ;

The evaluation unit is configured to evaluate the test results associated with the data in the test data set, so as to evaluate whether the test results associated with the data in the test data set satisfy a preset condition.

9. A machine learning device, comprising: an input device, an output device, a memory, and a processor coupled to the memory, wherein the input device, the output device, the processor, and the memory are connected to each other, wherein the The memory is used to store application program code, and the processor is configured to invoke the program code to execute the machine learning method of claim 7 .

10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program, the computer program comprising program instructions, the program instructions, when executed by a processor, cause the processor to execute as claimed The machine learning method described in claim 7.