CN118568256B

CN118568256B - Method and device for evaluating text classification performance of large language model

Info

Publication number: CN118568256B
Application number: CN202410578939.XA
Authority: CN
Inventors: 张艺琼; 姜涛; 石东升
Original assignee: Beijing Duyou Information Technology Co ltd
Current assignee: Beijing Duyou Information Technology Co ltd
Priority date: 2024-05-10
Filing date: 2024-05-10
Publication date: 2025-03-28
Anticipated expiration: 2044-05-10
Also published as: CN118568256A

Abstract

The present disclosure provides a method and device for evaluating the text classification performance of a large language model, and relates to the fields of artificial intelligence technology such as large language models, natural language processing, and deep learning. The method for evaluating the text classification performance of a large language model includes: obtaining an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task; obtaining evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, inputting the evaluation input data into the large language model to be evaluated, and using the output result of the large language model to be evaluated as the predicted answer corresponding to the evaluation text for different evaluation task types; obtaining a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to at least one sub-evaluation task according to the labeled answer and the predicted answer corresponding to the same evaluation task type of the evaluation text; obtaining the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

Description

Method and device for evaluating text classification performance of large language model

Technical Field

The disclosure relates to the technical field of internet, in particular to the technical field of artificial intelligence such as large language model, natural language processing, deep learning and the like. Provided are a method and a device for evaluating text classification performance of a large language model, an electronic device and a readable storage medium.

Background

The large language model (Large Language Model, LLM) is a generative model, and can generate corresponding reply content based on input data, and can process various natural language tasks including text classification.

For large language models, how to evaluate the performance of the model in text classification is a fundamental issue. The prior art usually only carries out evaluation by setting a text classification task to enable a large language model to execute, but the method belongs to coarse-grained evaluation, and the accuracy of evaluating the obtained text classification performance is low.

Disclosure of Invention

According to a first aspect of the disclosure, an evaluation method for text classification performance of a large language model is provided, and the evaluation method comprises the steps of obtaining an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, the sub-evaluation task comprises an entity identification task and a summary generation task, different evaluation data sets comprise evaluation texts and labeling answers corresponding to different evaluation task types of the evaluation texts, obtaining evaluation input data according to the evaluation texts and the evaluation task types corresponding to the evaluation texts, inputting the evaluation input data into the large language model to be evaluated, taking output results of the large language model to be evaluated as prediction answers corresponding to different evaluation task types of the evaluation texts, obtaining a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation corresponding to the at least one sub-evaluation task according to the labeling answers and the prediction answers of the evaluation texts, and obtaining text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

According to a second method of the present disclosure, an evaluation device for text classification performance of a large language model is provided, which includes an acquisition unit for acquiring an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, the sub-evaluation task includes an entity identification task and a summary generation task, different evaluation data sets include evaluation texts and labeling answers corresponding to different evaluation task types of the evaluation texts, a prediction unit for obtaining evaluation input data according to the evaluation texts and the evaluation task types corresponding to the evaluation texts, inputting the evaluation input data into the large language model to be evaluated, taking an output result of the large language model to be evaluated as a prediction answer corresponding to different evaluation task types of the evaluation texts, and a processing unit for obtaining a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to the at least one sub-evaluation task according to the evaluation texts, and an evaluation unit for obtaining performance of the large language model according to the source evaluation result and the at least one sub-evaluation text.

According to a third aspect of the present disclosure there is provided an electronic device comprising at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor for execution by the at least one processor to enable the at least one processor to perform the method as described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a block diagram of an electronic device for implementing a method of evaluating text classification performance of a large language model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the method for evaluating text classification performance of the large language model of the present embodiment specifically includes the following steps:

s101, acquiring an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, the sub-evaluation task comprises an entity identification task and a summary generation task, and different evaluation data sets comprise evaluation texts and labeling answers of the evaluation texts corresponding to different evaluation task types;

S102, obtaining evaluation input data according to evaluation texts and evaluation task types corresponding to the evaluation texts, inputting the evaluation input data into a large language model to be evaluated, and taking an output result of the large language model to be evaluated as a prediction answer of the evaluation texts corresponding to different evaluation task types;

S103, obtaining a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to the at least one sub-evaluation task according to the marked answers and the predicted answers of the evaluation text corresponding to the same evaluation task type;

s104, obtaining the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

According to the method for evaluating the text classification performance of the large language model, the purpose of evaluating the text classification performance of the large language model is achieved by introducing entity identification tasks and/or abstract generation tasks besides the text classification tasks in the evaluating process, and because the classification accuracy greatly depends on the understanding degree of the large language model on the text when the large language model is used for classifying the text, the higher the understanding degree of the large language model on the text is, the more accurate the classification accuracy of the large language model is, the more diversified sub-evaluating tasks related to text understanding are additionally arranged besides the text classification tasks serving as source evaluating tasks, so that fine-grained evaluation is performed on the large language model at different angles, and the evaluation accuracy of the text classification performance can be improved.

In this embodiment, the large language model to be evaluated is a generative large language model (Large Language Model, LLM). The large language model is a deep learning model trained by using a large amount of text data, and can process various natural language tasks such as text classification, text generation, entity recognition and the like.

In this embodiment, the different evaluation data sets obtained in S101 are executed, and correspond to different evaluation task types, where the evaluation task types include a text classification task and one of an entity recognition task and a summary generation task, the text classification task may be further divided into a classification task and a multi-classification task, and each evaluation data set includes at least one evaluation text and at least one labeling answer corresponding to the current evaluation task type.

The evaluation data set for executing the corresponding source evaluation task acquired in S101 in this embodiment may be referred to as a source evaluation data set, and the evaluation data set for the sub-evaluation task may be referred to as a sub-evaluation data set.

For example, if the evaluation dataset corresponding to the source evaluation task is dataset1, the label answer included in dataset1 is a classification label corresponding to each evaluation text, if the sub-evaluation task is entity identification task, if the evaluation dataset corresponding to the entity identification task is dataset2, the label answer included in dataset2 is entity included in each evaluation text, if the sub-evaluation task is summary generation task, if the evaluation dataset corresponding to the summary generation task is dataset3, the label answer included in dataset3 is summary text generated according to each evaluation text.

The source evaluation tasks in the embodiment may also correspond to different classification scenes, for example, in a text classification scene of product evaluation, classification tags included in an evaluation data set of the corresponding source evaluation task may be "product advantage evaluation", "product defect feedback", "product improvement suggestion", "shopping experience sharing", and "shopping query help" and the like, and in a text classification scene of content security, classification tags included in an evaluation data set of the corresponding source evaluation task may be "include sensitive words" and "do not include sensitive words".

It can be understood that, in this embodiment, the labeling answer included in the evaluation data set obtained in S101 may be obtained by processing the corresponding evaluation text by using a large language model, or may be obtained by using a manual labeling method, or may be obtained by combining the large language model with the manual labeling method.

That is, when evaluating the text classification performance of the large language model, the embodiment sets the text classification task as the source evaluation task, and additionally sets the entity recognition task and/or the abstract generation task related to text understanding as the sub-evaluation task, so that the text classification performance of the large language model is comprehensively evaluated in combination with the understanding degree of the large language model on the text in terms of entity recognition and/or abstract generation, and compared with the way of evaluating the large language model by only setting the text classification task, the embodiment can effectively improve the evaluation accuracy of the text classification performance.

In the embodiment, when executing S101, only the entity identification task may be used as a sub-evaluation task, only the summary generation task may be used as a sub-evaluation task, and both the entity identification task and the summary generation task may be used as sub-evaluation tasks.

In the embodiment, after an evaluation data set corresponding to different evaluation task types is obtained by executing S101, an evaluation input data is obtained by executing S102 according to an evaluation text and the evaluation task types corresponding to the evaluation text, the evaluation input data is input into a large language model to be evaluated, and an output result of the large language model to be evaluated is used as a prediction answer of the evaluation text corresponding to the different evaluation task types, wherein the evaluation task type corresponding to the evaluation text is the evaluation task type corresponding to the evaluation data set to which the evaluation text belongs.

It can be understood that when executing S102, the embodiment may select one large language model as the large language model to be evaluated, or may select a plurality of large language models as the large language model to be evaluated, that is, the embodiment may evaluate the text classification performance of only one large language model, or may evaluate the text classification performance of a plurality of large language models at the same time, and the embodiment may display a plurality of large language models, and further use the large language model selected by the input end as the large language model to be evaluated.

In the embodiment, when the step S102 is executed to obtain the evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, the following implementation manner may be adopted, wherein the evaluation task type corresponding to the evaluation text is determined, the evaluation task type in the embodiment is one of a text classification task, an entity identification task and a summary generation task, the prompt text is obtained according to the determined evaluation task type, the prompt text is a prompt, and the evaluation input data is obtained according to the evaluation text and the obtained prompt text.

For example, if the evaluation task type is a text classification task, the prompt text obtained in the embodiment of S102 may be "please classify the following text data and output a classification label", if the evaluation task type is an entity identification task, the prompt text obtained in the embodiment of S102 may be "please find out an entity from the following text data and output", and if the evaluation task type is a summary generation task, the prompt text obtained in the embodiment of S102 may be "please generate a summary from the following text data and output".

For example, if the evaluation text is "the mobile phone has high cost performance, smooth operation and excellent photographing effect-and output a classification label", the evaluation input data obtained by executing S102 in this embodiment may be "please classify the following text data, and output a classification label: the mobile phone has high cost performance, smooth operation and excellent photographing effect (I).

In the embodiment, when the step S102 is executed to obtain the evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, the evaluation task type corresponding to the evaluation text may be used as a suffix, and spliced with the evaluation text, and the splicing result may be used as the evaluation input data, for example, "evaluation text+text classification" is used as the evaluation input data, the "evaluation text+entity identification" is used as the evaluation input data, and the "evaluation text+abstract generation" is used as the evaluation input data.

In this embodiment, after executing S102 to input the obtained evaluation input data into the large language model to be evaluated, the large language model to be evaluated may perform corresponding processing according to the input evaluation input data, so as to output the predicted answers corresponding to different evaluation task types by the evaluation text.

In this embodiment, after executing S102 to obtain predicted answers corresponding to different evaluation task types by the evaluation text, executing S103 to obtain a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to at least one sub-evaluation task according to the labeled answer and the predicted answer corresponding to the same evaluation task type by the evaluation text.

When executing S103, the embodiment may first obtain a labeling answer corresponding to the evaluation text from the evaluation data set to which the evaluation text belongs, and then obtain an evaluation result corresponding to different evaluation task types according to the obtained labeling answer and a predicted answer output by the to-be-evaluated large language model.

For example, when the step S103 is performed to obtain the source evaluation result corresponding to the source evaluation task, evaluation indexes such as accuracy (accuracy), precision (precision), recall (recall), and F1-score (F1 score) may be used to perform matching calculation on the labeling answer and the prediction answer corresponding to the evaluation text, and the calculation result may be used as the source evaluation result corresponding to the source evaluation task.

For example, if the sub-evaluation task is an entity identification task, in the embodiment, when executing S103 to obtain a sub-evaluation result of the corresponding sub-evaluation task, evaluation indexes such as accuracy (accuracy), recall (recall) and the like may be used to perform matching calculation on the labeled answer and the predicted answer corresponding to the evaluation text, and the calculated result is used as the sub-evaluation result of the corresponding entity identification task.

For example, if the sub-evaluation task is a summary generation task, in the embodiment, when executing S103 to obtain a sub-evaluation result of the corresponding sub-evaluation task, the evaluation indexes such as Rouge (Recall Oriented Understudy for Gisting Evaluation), bert_score_f1 and the like may be used to perform matching calculation on the labeled answer and the predicted answer corresponding to the evaluation text, and the calculation result is used as the sub-evaluation result of the corresponding summary generation task.

In this embodiment, after executing S103 to obtain a source evaluation result corresponding to a source evaluation task and at least one sub-evaluation result corresponding to at least one sub-evaluation task, executing S104 to obtain text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

In the embodiment, when the step S104 is executed to obtain the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result, the addition result between the source evaluation result and the at least one sub-evaluation result may be used as the text classification performance of the large language model to be evaluated.

The embodiment can further comprise the steps of obtaining weight values corresponding to different evaluation task types when the step S104 is executed according to the source evaluation result and at least one sub-evaluation result, and obtaining the text classification performance of the large language model to be evaluated according to the weight values and the source evaluation result of the corresponding source evaluation task and the weight values and the at least one sub-evaluation result of the corresponding at least one sub-evaluation task.

That is, in this embodiment, by setting weight values corresponding to different evaluation tasks, text classification performance of the large language model is obtained more accurately according to evaluation results corresponding to different evaluation tasks.

In order to further highlight the significance of the text classification task in evaluating the text classification performance of the large language model, the weight value corresponding to the source evaluation task in the embodiment is larger than the weight value corresponding to the sub-evaluation task, and the weight values corresponding to different sub-evaluation tasks can be the same or different.

For example, if the weight value of the corresponding source evaluation task is the weight value 1, the weight value of the corresponding entity identification task is the weight value 2, the weight value of the corresponding abstract generation task is the weight value 3, and if the source evaluation result of the corresponding source evaluation task is the result 1, the sub-evaluation result of the corresponding entity identification task is the result 2, and the sub-evaluation result of the corresponding abstract generation task is the result 3, the calculation result obtained by (result 1×weight value 1+result 2×weight value 2+result 3×weight value 3) may be used as the text classification performance of the large language model to be evaluated when executing S104.

The number of the large language models to be evaluated in the text classification performance evaluation of the embodiment may be multiple, so that after executing S104 to obtain the text classification performance of each large language model to be evaluated, the embodiment may further select, according to the obtained text classification performance, one of the large language models to be evaluated with the optimal text classification performance (for example, the highest score) as the target large language model, and in the execution of S104, the embodiment may further display the selected target large language model at the input end.

In addition, in this embodiment, after the step S104 is performed to obtain the text classification performance of the large language model to be evaluated, the source evaluation result and at least one sub-evaluation result may be output together, in addition to outputting the obtained text classification performance.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in fig. 2, the present embodiment may further include the following:

S201, acquiring an evaluation text corresponding to a target classification scene;

S202, obtaining initial input data according to the evaluation text and different evaluation task types, and inputting the initial input data into a candidate large language model to obtain labeled answers, which are output by the candidate large language model aiming at the evaluation text, of the corresponding different evaluation task types;

S203, obtaining evaluation data sets corresponding to different evaluation task types according to the evaluation text and the labeling answers of the evaluation text corresponding to the different evaluation task types.

In this embodiment, the target classification scene may be a text classification scene for product evaluation, or may be a text classification scene for content security, that is, classifying whether the text contains a sensitive word.

If the target classification scene is a text classification scene for product evaluation, the embodiment can acquire a text for evaluation of the user after the purchase of goods or services is completed as an evaluation text when executing S201, and can acquire posting content or comments of the user on the network as an evaluation text when executing S201 if the target classification scene is a text classification scene with safe content.

In the embodiment, when executing S201, text preprocessing such as text simplified conversion, full-angle half-angle conversion, case-to-case conversion, web site symbol removal, continuous blank symbol replacement and the like may also be performed on the obtained evaluation text, so as to obtain a high-quality evaluation text.

In this embodiment, the different evaluation task types include at least one of an entity recognition task and a digest generation task, and a text classification task.

In the embodiment, when the step S202 is executed to obtain initial input data according to the evaluation text and different evaluation task types, the initial input data may be obtained according to the evaluation text and the prompt text corresponding to the different evaluation task types, or the initial input data may be obtained directly according to the evaluation text and the different evaluation task types.

Because the output data may not be standard when the large language model outputs the corresponding output data according to the initial input data, when executing S202 to obtain the labeling answers of the candidate large language model corresponding to different evaluation task types output by the evaluation text, the embodiment may also manually audit the output result of the candidate large language model, and further use the output result passing the audit as the labeling answer of the evaluation text corresponding to different evaluation task types.

In this embodiment, the number of the evaluation texts may be multiple, so in the execution of S203, in this embodiment, for each evaluation task type, according to different evaluation texts and the labeling answers of the different evaluation texts corresponding to the current evaluation task type, an evaluation dataset corresponding to the current evaluation task type may be obtained.

And in the evaluation data set corresponding to the text classification task, the evaluation texts corresponding to different classification labels are contained, and if the number of the evaluation texts corresponding to certain two classification labels is large, the evaluation accuracy of the large model to be evaluated can be affected.

Therefore, in this embodiment, after executing S203 to obtain the evaluation data set corresponding to the text classification task, clustering is performed for the evaluation texts of all the classification labels, so as to ensure that the number of the evaluation texts corresponding to each classification label in the evaluation data set is balanced.

In the embodiment, when executing S203, a cluster center of the evaluation texts corresponding to each class label may be determined first, and then a preset number of evaluation texts near the cluster center may be selected as the evaluation texts corresponding to each class label, for example, 5000 evaluation texts near the cluster center may be selected.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. As shown in fig. 3, when executing S102 "take the output result of the large language model to be evaluated as the predicted answer of the evaluation text corresponding to different evaluation task types", the embodiment may further include the following:

S301, constructing a label syntax tree according to the classification labels corresponding to the target classification scenes;

s302, under the condition that the evaluation task type is determined to be a text classification task, carrying out text analysis on the output result of the large language model to be evaluated to obtain an analysis result;

S303, extracting a part of the analysis result, which is matched with the tag syntax tree, from the output result to serve as the prediction answer.

When the to-be-evaluated large language model in the embodiment executes the text classification task, the output result obtained according to the input data may be only the text corresponding to the classification label, or may include other text besides the text corresponding to the classification label, so that when the evaluation task type is determined to be the text classification task, the embodiment accurately obtains the text corresponding to the classification label from the output result by matching the analysis result of the output result with the label syntax tree.

In this embodiment, when executing S301, the embodiment may first determine the target classification scene, then acquire the classification label corresponding to the determined target classification scene, and finally construct a label syntax tree according to the acquired classification label.

For example, if the target classification scene is a text classification scene of product evaluation, class labels such as "product advantage evaluation", "product defect feedback", "product improvement suggestion", "shopping experience sharing" and "shopping query help" may be obtained, and the label syntax tree constructed according to the classification label in the embodiment of S301 may be:

S->Product|Shopping

Product- > "Product" Attribute

Attribute- > "advantage evaluation" | "defect feedback" | "improvement suggestion"

Shopping Attribute

Attribute- > "experience sharing" | "doubtful help" and "experience sharing" | "doubtful help"

In the embodiment, when executing S302, the parsing result corresponding to the output result may be obtained by using the syntax analyzer to perform text parsing on the output result.

For example, if the output result of the large language model to be evaluated is "according to your input, the classification label of the text is product advantage evaluation", after matching the analysis result of the output result with the label syntax tree, only extracting the "product advantage evaluation" matching the two from the output result as the prediction answer.

Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. As shown in fig. 4, after executing S104 "to obtain the text classification performance of the large language model to be evaluated", the present embodiment may further include the following:

S401, under the condition that the large language model to be evaluated is determined to be a target large language model according to the text classification performance, acquiring a text to be classified corresponding to a target classification scene;

S402, inputting the text to be classified into the target large language model to perform text classification, and taking the output result of the target large language model as the classification result of the text to be classified.

That is, in this embodiment, after the evaluation of the text classification performance of the large language model to be evaluated is completed, whether the large language model to be evaluated is the target large language model can be determined according to the obtained text classification performance, and then after the determination, the obtained text to be classified corresponding to the target classification scene is input into the target large language model for text classification, so that the accuracy of the obtained text classification result can be improved.

In the embodiment, when executing S401, if there is only one large language model to be evaluated, the large language model to be evaluated may be determined to be the target large language model under the condition that it is determined that the text classification performance of the large language model to be evaluated exceeds the preset threshold, and when executing S401, the embodiment may use the large language model to be evaluated with the best text classification performance as the target large language model.

It can be understood that, in the embodiment, when executing S402, the classification output data may be obtained according to the text to be classified and the text classification task, and then the classification input data is input into the target large language model, and in the case that the target large language model has been set to execute only the text classification task, only the text to be classified may be input into the target large language model.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. FIG. 5 shows a flow chart of the embodiment when constructing evaluation data corresponding to different evaluation task types, wherein the flow chart comprises the steps of S501, acquiring evaluation texts, S502, preprocessing the acquired evaluation texts, S503, uniformly clustering the evaluation texts belonging to different classification labels under the condition that the acquired evaluation texts have corresponding classification labels, S504, setting different types of evaluation tasks, S505, setting prompt texts corresponding to the different evaluation task types, S506, using a candidate large language model, carrying out fine granularity labeling on the evaluation texts aiming at the different types of evaluation tasks, S507, acquiring labeling answers corresponding to the different evaluation task types of the evaluation texts according to the output result of the candidate large language model, and S508, constructing an evaluation data set corresponding to the different evaluation task types according to the evaluation texts and the labeling answers corresponding to the evaluation text.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. FIG. 6 shows a flow chart of the embodiment when evaluating the text classification performance of the large language model, S601, selecting different types of evaluation tasks, S602, selecting the large language model to be evaluated, S603, reading an evaluation data set corresponding to the selected evaluation tasks, S604, calling the selected large language model to be evaluated, generating a prediction answer for the evaluation text in the evaluation data set, S605, constructing a label syntax tree according to a classification label when the evaluation task type is determined to be the text classification task, S606, taking an output result of the large language model to be evaluated as a prediction answer when the evaluation task type is determined not to be the text classification task, S607, selecting a matched part from the output result as the prediction answer when the output result of the large language model to be evaluated is determined to be effective according to the label syntax tree (namely, determining whether the analysis result of the output result exists in the label syntax tree), otherwise, obtaining an output result invalid (label misinformation can be used for recording), S608, obtaining a label corresponding to the evaluation text according to the label and the evaluation text, and obtaining the performance classification result of the large language model to be evaluated when the evaluation task type is determined to be the different.

Fig. 7 is a schematic diagram according to a seventh embodiment of the present disclosure. As shown in fig. 7, an evaluation apparatus 700 for text classification performance of a large language model of the present embodiment includes:

the method comprises an acquisition unit 701, an evaluation data set for acquiring an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, the sub-evaluation task comprises an entity identification task and a summary generation task, and different evaluation data sets comprise evaluation texts and labeling answers of the evaluation texts corresponding to different evaluation task types;

The prediction unit 702 is configured to obtain an evaluation input data according to an evaluation text and an evaluation task type corresponding to the evaluation text, input the evaluation input data into a large language model to be evaluated, and use an output result of the large language model to be evaluated as a prediction answer of the evaluation text corresponding to different evaluation task types;

The processing unit 703 is configured to obtain a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to the at least one sub-evaluation task according to the labeled answer and the predicted answer of the evaluation text corresponding to the same evaluation task type;

And the evaluation unit 704 is configured to obtain a text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

The different evaluation data sets acquired by the acquisition unit 701 correspond to different evaluation task types, the evaluation task types comprise a text classification task and one of entity identification task and abstract generation task, the text classification task can be further divided into a classification task and a multi-classification task, and each evaluation data set comprises at least one evaluation text and at least one labeling answer of the evaluation text corresponding to the current evaluation task type.

The profile data set corresponding to the source profile task acquired by the acquiring unit 701 may be referred to as a source profile data set, and the profile data set corresponding to the sub-profile task may be referred to as a sub-profile data set.

It may be understood that the labeling answer included in the evaluation data set acquired by the acquiring unit 701 may be obtained by processing the corresponding evaluation text through a large language model, or may be obtained through a manual labeling manner, or may be obtained through a combination of a large language model and a manual labeling manner.

The acquiring unit 701 may use only the entity identification task as a sub-evaluation task, may use only the summary generation task as a sub-evaluation task, and may use both the entity identification task and the summary generation task as sub-evaluation tasks.

In this embodiment, after an acquisition unit 701 acquires an evaluation data set corresponding to different evaluation task types, a prediction unit 702 obtains an evaluation input data according to an evaluation text and an evaluation task type corresponding to the evaluation text, the evaluation input data is input into a large language model to be evaluated, and an output result of the large language model to be evaluated is used as a prediction answer of the evaluation text corresponding to the different evaluation task types, wherein the evaluation task type corresponding to the evaluation text is an evaluation task type corresponding to the evaluation data set to which the evaluation text belongs.

It may be understood that the prediction unit 702 may select one large language model as the large language model to be evaluated, or may select a plurality of large language models as the large language model to be evaluated, that is, the embodiment may evaluate the text classification performance of only one large language model, or may evaluate the text classification performance of a plurality of large language models at the same time, and the embodiment may display a plurality of large language models, and further use the large language model selected by the input end as the large language model to be evaluated.

When the prediction unit 702 obtains the evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, the following implementation manner may be adopted, wherein the evaluation task type corresponding to the evaluation text is determined, the evaluation task type in the embodiment is one of a text classification task, an entity identification task and a summary generation task, the prompt text is obtained according to the determined evaluation task type, the prompt text is a prompt, and the evaluation input data is obtained according to the evaluation text and the obtained prompt text.

When the prediction unit 702 obtains the evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, the evaluation task type corresponding to the evaluation text may be used as a suffix to splice with the evaluation text, and the splicing result may be used as the evaluation input data, for example, "evaluation text+text classification" is used as the evaluation input data, "evaluation text+entity identification" is used as the evaluation input data, and "evaluation text+abstract generation" is used as the evaluation input data.

After the obtained evaluation input data is input into the large language model to be evaluated, the large language model to be evaluated can be correspondingly processed according to the input evaluation input data by the prediction unit 702, so that prediction answers of evaluation texts corresponding to different evaluation task types are output.

The prediction unit 702 may further include constructing a tag syntax tree according to the classification tags corresponding to the target classification scene when the output result of the large language model to be evaluated is used as the prediction answer of the evaluation text corresponding to the different evaluation task types, performing text parsing on the output result of the large language model to be evaluated to obtain a parsing result when the evaluation task type is determined to be a text classification task, and extracting a portion of the parsing result matched with the tag syntax tree from the output result as the prediction answer.

In this embodiment, the prediction unit 702 may first determine the target classification scene, then acquire the classification label corresponding to the determined target classification scene, and finally construct the label syntax tree according to the acquired classification label.

The prediction unit 702 may use a syntax analyzer to perform text parsing on the output result, thereby obtaining a parsing result corresponding to the output result.

In this embodiment, after the prediction unit 702 obtains the prediction answers corresponding to different evaluation task types by the evaluation text, the processing unit 703 obtains the source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to at least one sub-evaluation task according to the labeled answer and the prediction answer corresponding to the same evaluation task type by the evaluation text.

The processing unit 703 may first obtain a labeling answer corresponding to the evaluation text from the evaluation data set to which the evaluation text belongs, and then obtain an evaluation result corresponding to different evaluation task types according to the obtained labeling answer and a predicted answer output by the large language model to be evaluated.

In this embodiment, after the processing unit 703 obtains a source evaluation result corresponding to a source evaluation task and at least one sub-evaluation result corresponding to at least one sub-evaluation task, the evaluation unit 704 obtains the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

The evaluation unit 704 may use the addition result between the source evaluation result and the at least one sub-evaluation result as the text classification performance of the large language model to be evaluated when obtaining the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result.

The evaluation unit 704 may further include obtaining weight values corresponding to different evaluation task types when obtaining the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result, and obtaining the text classification performance of the large language model to be evaluated according to the weight values corresponding to the source evaluation task and the source evaluation result and the weight values corresponding to the at least one sub-evaluation task and the at least one sub-evaluation result.

In this embodiment, the number of the large language models to be evaluated may be multiple when performing the text classification performance evaluation, so after obtaining the text classification performance of each large language model to be evaluated, the evaluation unit 704 may further select, according to the obtained text classification performance, one of the large language models to be evaluated with the optimal text classification performance (for example, the highest score) as the target large language model, and the evaluation unit 704 may further display the selected target large language model at the input end.

In addition, after obtaining the text classification performance of the large language model to be evaluated, the evaluation unit 704 may output the source evaluation result and at least one sub-evaluation result together, in addition to outputting the obtained text classification performance.

The evaluation device 700 for text classification performance of the large language model of the embodiment may further include a construction unit 705, configured to obtain an evaluation text corresponding to the target classification scene, obtain initial input data according to the evaluation text and different evaluation task types, input the initial input data into the candidate large language model to obtain labeled answers of the candidate large language model corresponding to different evaluation task types output by the evaluation text, and obtain an evaluation data set corresponding to different evaluation task types according to the evaluation text and the labeled answers of the evaluation text corresponding to different evaluation task types.

If the target classification scene is a text classification scene for product evaluation, the construction unit 705 can acquire a text for evaluation by the user after the purchase of goods or services is completed as an evaluation text, and if the target classification scene is a text classification scene for content safety, the construction unit 705 can acquire posting content or comments of the user on a network as an evaluation text.

The construction unit 705 may also perform text preprocessing such as text simplified conversion, full-angle half-angle conversion, case-case conversion, removal of website symbols, replacement of continuous blank symbols, etc., on the obtained evaluation text, thereby obtaining a high-quality evaluation text.

When the construction unit 705 obtains initial input data according to the evaluation text and different evaluation task types, the initial input data can be obtained according to the evaluation text and prompt text corresponding to different evaluation task types, or the initial input data can be obtained directly according to the evaluation text and different evaluation task types.

Because the output data may not be standard when the large language model outputs the corresponding output data according to the initial input data, when the construction unit 705 obtains the labeling answers of the candidate large language model corresponding to different evaluation task types output by the evaluation text, the construction unit may also manually audit the output result of the candidate large language model, and further use the output result passing the audit as the labeling answer of the evaluation text corresponding to different evaluation task types.

In this embodiment, the number of the evaluation texts may be multiple, so the construction unit 705 may obtain, for each evaluation task type, an evaluation data set corresponding to the current evaluation task type according to different evaluation texts and the labeled answers of the different evaluation texts corresponding to the current evaluation task type.

Therefore, after obtaining the evaluation data set corresponding to the text classification task, the construction unit 705 performs clustering uniform sampling on the evaluation texts of all the classification tags, so as to ensure that the number of the evaluation texts corresponding to each classification tag in the evaluation data set is balanced.

The construction unit 705 may first determine a cluster center of the evaluation texts corresponding to each class label, and then select a preset number of evaluation texts near the cluster center as the evaluation texts corresponding to each class label, for example, select 5000 evaluation texts near the cluster center.

The evaluation device 700 for text classification performance of a large language model of the present embodiment may further include a classification unit 706, configured to obtain a text to be classified corresponding to a target classification scene when the large language model to be evaluated is determined to be a target large language model according to the text classification performance, input the text to be classified into the target large language model for text classification, and use an output result of the target large language model as a classification result of the text to be classified.

That is, the classifying unit 706 can also determine whether the large language model to be evaluated is the target large language model according to the obtained text classification performance after completing the evaluation of the text classification performance of the large language model to be evaluated, and further input the obtained text to be classified corresponding to the target classification scene into the target large language model to perform text classification after determining, so that the accuracy of the obtained text classification result can be improved.

If there is only one large language model to be evaluated, the classification unit 706 may determine that the large language model to be evaluated is a target large language model if it is determined that the text classification performance of the large language model to be evaluated exceeds a preset threshold, and if there are multiple large language models to be evaluated, the classification unit 706 may use the large language model to be evaluated with the best text classification performance as the target large language model.

It may be appreciated that the classification unit 706 may obtain classification output data according to the text to be classified and the text classification task, and further input the classification input data into the target large language model, and in the case that the target large language model has been set to perform only the text classification task, may input only the text to be classified into the target large language model.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

As shown in fig. 8, there is a block diagram of an electronic device of a method for evaluating text classification performance of a large language model according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in the device 800 are connected to the I/O interface 805, including an input unit 806, such as a keyboard, mouse, etc., an output unit 807, such as various types of displays, speakers, etc., a storage unit 808, such as a magnetic disk, optical disk, etc., and a communication unit 809, such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, for example, an evaluation method of text classification performance of a large language model. For example, in some embodiments, the method of evaluating text classification performance of a large language model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808.

In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the above-described evaluation method of text classification performance of a large language model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method of evaluating text classification performance of a large language model in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable apparatus for evaluating the textual classification performance of a large language model, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a presentation device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for presenting information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for evaluating the text classification performance of a large language model, comprising:

Obtain an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, and the sub-evaluation tasks include an entity recognition task and a summary generation task, and different evaluation data sets include evaluation texts and annotated answers corresponding to different evaluation task types of the evaluation texts;

Obtaining evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, inputting the evaluation input data into the large language model to be evaluated, and using the output result of the large language model to be evaluated as the predicted answer corresponding to the evaluation text for different evaluation task types;

Obtaining a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to the at least one sub-evaluation task according to the annotated answers and the predicted answers corresponding to the same evaluation task type of the evaluation text;

Obtaining text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result;

Wherein, obtaining the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result includes:

Obtaining weight values corresponding to different evaluation task types, wherein the weight value of the source evaluation task is greater than the weight value of the sub-evaluation task;

The text classification performance of the large language model to be evaluated is obtained according to the weight value corresponding to the source evaluation task and the source evaluation result, and the weight value corresponding to the at least one sub-evaluation task and the at least one sub-evaluation result.

2. The method according to claim 1, wherein obtaining evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text comprises:

Determining the evaluation task type corresponding to the evaluation text;

Obtaining a prompt text according to the evaluation task type;

The evaluation input data is obtained according to the evaluation text and the prompt text.

3. The method according to claim 1, further comprising:

Obtain the evaluation text corresponding to the target classification scenario;

Obtaining initial input data according to the evaluation text and different evaluation task types, inputting the initial input data into a candidate large language model, and obtaining annotated answers corresponding to different evaluation task types output by the candidate large language model for the evaluation text;

According to the evaluation text and the annotated answers of the evaluation text corresponding to different evaluation task types, evaluation data sets corresponding to different evaluation task types are obtained.

4. The method according to claim 1, wherein the step of using the output result of the large language model to be evaluated as a predicted answer corresponding to different evaluation task types of the evaluation text comprises:

Construct a label syntax tree according to the classification labels corresponding to the target classification scenario;

When it is determined that the evaluation task type is a text classification task, performing text parsing on the output result of the large language model to be evaluated to obtain a parsing result;

The portion of the parsing result that matches the label syntax tree is extracted from the output result as the predicted answer.

5. The method according to claim 1, further comprising:

When it is determined that the large language model to be evaluated is a target large language model according to the text classification performance, obtaining a text to be classified corresponding to a target classification scenario;

The text to be classified is input into the target large language model for text classification, and the output result of the target large language model is used as the classification result of the text to be classified.

6. A device for evaluating the text classification performance of a large language model, comprising:

An acquisition unit is used to acquire an evaluation data set corresponding to a source evaluation task and an evaluation data set corresponding to at least one sub-evaluation task, wherein the source evaluation task is a text classification task, and the sub-evaluation tasks include an entity recognition task and a summary generation task, and different evaluation data sets include evaluation texts and annotated answers corresponding to different evaluation task types of the evaluation texts;

A prediction unit, used to obtain evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, input the evaluation input data into the large language model to be evaluated, and use the output result of the large language model to be evaluated as the prediction answer of the evaluation text corresponding to different evaluation task types;

A processing unit, configured to obtain a source evaluation result corresponding to the source evaluation task and at least one sub-evaluation result corresponding to the at least one sub-evaluation task according to the annotated answers and the predicted answers corresponding to the same evaluation task type of the evaluation text;

An evaluation unit, configured to obtain the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result;

Wherein, when the evaluation unit obtains the text classification performance of the large language model to be evaluated according to the source evaluation result and the at least one sub-evaluation result, it specifically performs:

7. The device according to claim 6, wherein when the prediction unit obtains the evaluation input data according to the evaluation text and the evaluation task type corresponding to the evaluation text, specifically performs:

Determining the evaluation task type corresponding to the evaluation text;

Obtaining a prompt text according to the evaluation task type;

8. The apparatus according to claim 6, further comprising a construction unit for executing:

Obtain the evaluation text corresponding to the target classification scenario;

9. The device according to claim 6, wherein when the prediction unit uses the output result of the large language model to be evaluated as the predicted answer of the evaluation text corresponding to different evaluation task types, specifically performs:

10. The apparatus according to claim 6, further comprising a classification unit, configured to perform:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 5.

12. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1 to 5.

13. A computer program product, comprising a computer program, which, when executed by a processor, implements the method according to any one of claims 1 to 5.