CN119807407B

CN119807407B - A conference summary generation method based on paralinguistic acoustic features

Info

Publication number: CN119807407B
Application number: CN202411880573.8A
Authority: CN
Inventors: 王璐; 李奕龙; 陈凯鑫
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2024-12-19
Filing date: 2024-12-19
Publication date: 2025-10-17
Anticipated expiration: 2044-12-19
Also published as: CN119807407A

Abstract

This invention discloses a method for generating conference summaries based on paralinguistic acoustic features. The method includes: converting conference audio into text and extracting key information at the text level using a deep learning model; extracting key acoustic information from the conference audio to obtain content with significant speech features that meet set standards, where the significant speech features are determined based on intonation, speech rate, and pause duration; designing open-ended questions with the goal of uncovering clues about the conference to guide a large language model in answering and obtaining answer information; and integrating the key textual information, key acoustic information, transcribed text, open-ended questions, and answer information, and inputting the information into the large language model to generate a conference summary. This invention improves the quality and efficiency of conference summary generation.

Description

Conference abstract generation method based on acoustic features of secondary language

Technical Field

The invention relates to the technical field of acoustic feature recognition, in particular to a conference abstract generation method based on accessory language acoustic features.

Background

Conference is the main way of information exchange in the job site. During a conference, the conversation of the speaker is typically unstructured or unedited, often in real-time without prior preparation. Thus, important information in a meeting is typically less dense, and the words used may appear scattered, repeated, and inconsistent. In other words, the speech at the conference may contain a lot of "noise". In this case, the meeting summary becomes very important, which not only provides a basis for participants to review and trace back the meeting history, but also enables people not participating in the meeting to quickly learn important contents and conclusions of the meeting. However, organizing meeting offerings is a challenging task. The recorder needs to record the key information in the meeting rapidly, and in the process, the recorder faces various difficulties such as different speech speeds of the participants, professional knowledge understanding, emergency handling and the like. After the meeting is completed, the summary needs to be written based on notes taken during the meeting and understanding and recall the meeting content.

In the prior art, the automatic abstract generation scheme mainly comprises two types, namely an extraction type abstract and a generation type abstract. Early, some studies focused on the extraction of meeting summaries, for example, by intercepting some of the portions of text that make up the summary. However, the meeting abstract obtained by this way of extraction is poorly readable and difficult for third-party readers to understand. The conference abstract obtained based on the generation formula is clear and concise in description, and the readability is much higher. The advent of transducer-based pre-training models, such as BERT, BART, and T5, has led to significant improvements in Natural Language Generation (NLG) tasks in the early years. Despite their great success in generating tasks, there is still a disadvantage in that a complete corpus is required. The current work on meeting summaries is mainly focused on the supervision approach, which makes training models inevitably dependent on large corpora.

One potential approach to solving the data limitation problem is to turn to meeting summarization techniques that rely on weaker supervision. In recent years, the popularity of large-scale language models (LLMs) and their powerful generalization capabilities have created new opportunities for meeting abstracts. LLM is able to understand world knowledge and context and can be used directly to get meeting summaries without any additional training. However, if the meeting record text is directly submitted to LLM processing, the result of the generation is based on the text modality completely, which means that the key information represented by other modalities may be ignored, resulting in that the summary generated finally is not emphasized and has poor quality.

In view of the foregoing, there is a need for improvements in the art to further improve the quality and readability of automatic digest generation.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a conference abstract generation method based on the acoustic features of the auxiliary language. The method comprises the following steps:

Converting conference audio into text, obtaining transcribed text, and extracting key information of a text layer by using a deep learning model;

Extracting key information of an acoustic layer from conference audio to obtain content of salient voice features meeting set standards, wherein the salient voice features are determined based on intonation, speech speed and pause time;

aiming at the clue of the conference, designing an open question to guide a large language model to answer so as to obtain answer information;

And integrating the key information of the text layer, the key information of the acoustic layer, the transcribed text, the open questions and the answer information, and inputting the integrated information into the large language model to generate a meeting abstract.

Compared with the prior art, the conference abstract generation method based on the auxiliary language acoustic features has the advantages that the keyword information of a text layer is firstly extracted by means of a deep learning method, acoustic feature information such as intonation, speed and pause is then extracted from conference audio, a voice segment with obvious features is identified, an open question is further designed to guide a large language model to mine the keyword information and clues in the conference, and then the text keyword, the acoustic keyword information, a transcribed text and answers of the large language model are integrated to generate a final conference abstract. According to the method, the auxiliary language acoustic characteristic information is integrated into the conference abstract generation process, emotion, attitude, importance degree and other information contained in the voice information are fully utilized, and the accuracy and the readability of the conference abstract are effectively improved.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a process diagram of a method for generating a meeting summary based on acoustic features of a secondary language, according to one embodiment of the invention;

FIG. 2 is a flow chart of a method of meeting summary generation based on acoustic features of a secondary language in accordance with one embodiment of the present invention;

FIG. 3 is a diagram illustrating the comparison of the variation of the fundamental frequency in a normal intonation versus an emphasized intonation according to one embodiment of the present invention;

In the drawings, SPEECH KEYPHASE-voice key fragments, intonation Recognition-intonation Recognition, long Stop AND HIGH SPEED Recognition-Long Stop and quick voice Recognition and Keyphase Extraction-key fragment extraction.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

The existing conference abstract generation model is almost completely focused on text information in the voice transcription text, and other auxiliary language information such as intonation, body language, facial expression and the like in the conference process is ignored. However, these secondary linguistic cues may convey more rich information and aid in understanding context and perception of critical information. The invention explores the method for generating the conference abstract by utilizing the acoustic modal information for the first time, and extracts key information of a voice layer from the conference record by using a machine learning method with high interpretability, such as accent-emphasized spoken content with higher intonation, proper pause and higher speech speed, and the content can be well adapted to different speakers.

In general, referring to fig. 1, the provided method for generating a meeting abstract based on acoustic features of a secondary language first transcribes a section of a meeting recording (a) into text (B), and then processes (C) the meeting recording and the transcribed text to extract acoustic features of the language. After processing, text with key phrase labels (including importance) is obtained, and then the final meeting abstract (E) is output after processing (D) through a large model.

Specifically, as shown in fig. 1 and fig. 2, the provided conference abstract generating method based on the acoustic features of the secondary language comprises the following steps:

and S1, converting conference audio into text, and extracting key information of a text layer by using a deep learning model.

The key phrase is the core of the text segment and is important to the overall text semantics. The research shows that the keywords are core parts in sentences and can be used as indexes for selecting important sentences. In order to avoid resource waste caused by training a model as much as possible and realize efficient meeting abstract generation, an unsupervised model can be adopted to extract key phrases.

In one embodiment, extracting text key information includes first transcribing audio into text, using an unsupervised keyword extraction model, to obtain key information. For example, the keyword extraction model adopts MDERank model (BERT-based keyword extraction method), and the candidate keywords are ranked by utilizing BERT embedded similarity between the source document and the mask document, so that the keyword extraction model has stronger performance and generalization capability. Herein, for each speaker, K _i is used to represent the text keyword set for that speaker.

It should be noted that other deep learning models may be used to extract key information of the text, such as KeyBert, textRank.

And S2, extracting key information of an acoustic layer from conference audio, wherein the key information is mainly used for highlighting text contents with large speed change and obvious pause of the intonation.

For example, the fundamental frequency f ₀ information of the voice is extracted from the conference record, the speech speed of each sentence is calculated, and the parts with large variation and obvious pause of the intonation and the speech speed of each speaker are selected by using a clustering algorithm to serve as key information of an acoustic layer.

As shown in fig. 3, the left graph is a normal intonation and the right graph is an emphasized intonation, and the fundamental frequency f ₀ of the right graph is clearly seen to be higher overall and more fluctuating.

In one embodiment, extracting key information for an acoustic layer includes the steps of:

In step S201, the fundamental frequency is extracted.

The fundamental frequency (f ₀) of speech is a key indicator describing intonation changes. The fundamental frequency reflects the lowest frequency component of the sound and is used to analyze the up-down variation of the intonation. For example, for all speech segments of each speakerFirst, the fundamental frequency value of each segment is extracted by a fundamental frequency extraction algorithm pYIN, and then the average fundamental frequency of each segment is calculatedFor subsequent intonation analysis and processing.

Step S202, calculating the speech rate.

The speech rate is an important indicator for measuring the fluency and understandability of a language. For example, the speech rate is calculated by the ratio of the number of words per speech segment to the length of the speech. Specifically, for each speech segmentFirst, the word number of the transcribed text is calculatedAnd then by calculating the durationFinally obtain the speech rate

In step S203, a pause time is calculated.

The pauses play an important role in speech and can effectively enhance the expression of important content. For example, in processing pauses, a pause time calculation method based on word units is employed. For each speech segmentFirst obtain its transcribed textThe pause time before each word is then calculated based on the start time and end time of each word

Step S204, based on the intonation, the speech speed and the pause duration, a speech segment with significant features is extracted.

To automatically identify which speech features (e.g., high intonation, fast speech, long pauses) meet the criteria of "high", "fast" and "long", a clustering method in unsupervised learning may be employed. Considering that the way of fixing the threshold is not universally applicable due to the natural variance of each speaker, in one embodiment, a K-Means clustering algorithm is employed to automatically identify and extract speech segments with significant features. For example, for each speaker S _i, three feature sets are obtained, a high intonation segment set, a fast-speed segment set, and a long pause segment set.

In summary, by focusing on acoustic mode information, a number of important acoustic features are extracted. These features help to understand the different emotions, attitudes, and key information of conference participants. This is because when people want to emphasize a certain point of view or emphasize specific information, they typically increase intonation, making keywords or key phrases more prominent in speech, making them more important in the whole sentence or paragraph. Such intonation variations may help listeners better understand and remember the information conveyed.

And S3, guiding the large language model to answer by designing an openness question, and mining clues and important information of the conference.

In order to generate a more specialized meeting summary using a large language model, the design thought chain optimizes prompts input to the large language model.

For example, by guiding the LLM to answer multiple questions to mine clues and important information for meetings, the information of open questions is found more efficient by analysis than the information of structured questions. Thus, in one embodiment, four patency questions (Q) are designed to help LLM mine critical information and cues in the meeting. Four problems of design are as follows:

q1: who the participants of the conference are;

Q2. This conference discusses which topics;

q3, during the conference, which key events related to the theme are;

Q4 which consensus is reached on the relevant topic on the meeting.

The method has the advantages of cross-content applicability and effectiveness by designing the openness problem.

And S4, integrating the key information of the text layer, the key information of the acoustic layer, the transcribed text and the guiding problem to generate a abstract.

In this step, a final summary is generated by integrating the text key information, the voice key information, the transcribed text, and the guidance questions. For example, the LLM is provided with text keywords, larger intonation and speech speed fluctuations, and content of long pauses as key information, and different weights are assigned to these key information. And then, weighting the key information in a time dimension overlapping mode to obtain the final importance degree of the text, and embedding the final importance degree into the prompt word.

Specifically, LLM answers (R) to four questions are used for further information integration. Next, key information (including text level and speech level), conference transcript text, guided questions, and LLM answers are spliced to form a new input form K, D, Q, R, and input into LLM to generate a final summary, note that the text and acoustic level key information is assigned different weights.

To sum up, in order to improve the adaptability of LLM to different conference contents and reduce the training cost to meet the original purpose of using LLM, a weighting method is adopted to integrate text key information and voice key information, and guide problems related to conference elements are embedded to restrict and guide LLM to generate conference digests with higher quality. It should be noted that this constraint approach prior to LLM can seamlessly transition to more advanced LLM.

Further, the entire meeting summary generation flow may be packaged and an application program, or Paralinguistic-SUM application, may be developed based on human-machine collaboration concepts. In the whole abstract generation process, the user can participate in and interact with the system

Because Artificial Intelligence (AI) may be erroneous and each user's interpretation of the results is subjective, the user may not be fully satisfied with the labeling results given by AI. If the user is not provided with the opportunity to adjust the labeling results, the quality of the generated meeting summary and the user satisfaction may be directly affected. For the reasons described above, the patency of the labeling results is maintained, allowing the user to make adjustments. By providing the user with the function of adjusting the annotation result, the quality of the conference abstract and the high satisfaction of the user are ensured.

In order to reduce the interaction burden between a user and a system and improve the use convenience, the designed interaction interface of the Paralinguistic-SUM application program comprises synchronous display of audio playing and transcribed text, key phrase annotation adjustment, answer modification functions and the like. For example, interface A provides browsing functionality for conference audio, supports variable speed playback, and contains synchronized transcription text. The interface B and the interface C are respectively used for the user to adjust the replies given by the key phrases and the large language model aiming at the guiding problem.

Specifically, for the audio playback and transcript text synchronous display function, the application allows the user to browse conference audio and support variable speed playback. At the same time, the audio transcript text is synchronized with the audio playback and time stamps are added. For the key phrase annotation adjustment function, the user may modify the AI-generated key phrase, with the annotated key phrase being distinguished by different background colors, e.g., red for most importance, yellow for medium importance, and green for least importance. The user may modify the level of phrases at the portion that needs to be adjusted, add new key phrases, or cancel existing key phrase labels. For the answer modification function, the user can modify the AI's answer to the introductory questions, ensuring that the generated summary is more consistent with the actual content. The application tags the annotations generated by the AI to help distinguish the annotations of the AI from the user-modified portions.

In summary, as the invention is used as a tool for cooperating with human beings, errors can be generated, and users can play their subjective activity by participating in the process of generating the abstract of the meeting, thereby improving the quality of generating the abstract. The designed application program can be embedded into a smart phone or other intelligent devices, so that the user can easily inject own understanding and knowledge, a modification channel of all contents is provided for the user, and friendly interaction is provided.

To further verify the effect of the present invention, the results of the generation after using the secondary language acoustic feature and the results of the generation without using the secondary language acoustic feature are compared for the same piece of text. Through analysis and comparison, the abstract generated based on the acoustic features of the secondary language can transmit the intention of the speaker more because of considering acoustic features such as speech speed, intonation and the like.

In summary, the invention provides a method for enhancing the conference abstract capability based on a large language model based on the acoustic features of the auxiliary language, which fully utilizes the emotion, attitude, importance degree and other information contained in the voice information by integrating the acoustic feature information of the auxiliary language into the conference abstract generation process. And the acoustic mode information is utilized to assist in generating the conference abstract, so that the accuracy and the readability of the conference abstract are effectively improved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++, python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for generating conference summaries based on paralinguistic acoustic features, comprising the following steps:

Convert meeting audio to text, obtain transcripts, and use deep learning models to extract key information at the text level;

Extracting key acoustic information from the conference audio to obtain content with significant speech features that meet set criteria, where the significant speech features are determined based on intonation, speech rate, and pause duration;

With the goal of uncovering clues from the meeting, we designed open-ended questions to guide the large language model to answer and obtain answer information.

Integrating the key information at the text level, the key information at the acoustic level, the transcript, the open questions, and the answer information, and inputting the results into the large language model to generate a meeting summary;

The following steps are used to obtain content with significant speech features that meet the set criteria:

For each speech segment of the speaker , extract the fundamental frequency value of each speech segment and obtain the degree of intonation change by comparing it with the average fundamental frequency, where the subscript i is the speaker index and the superscripts 1 to m are the speech segment indexes;

For each speech segment , calculate the word count of its transcribed text and calculate its duration by , get the speaking speed ;

For each speech segment , get its transcript , and calculate the pause time before each word based on the start time and end time of each word ;

Based on the acquired information on intonation variation, speaking speed, and pause time, the system extracts significant speech features that meet the set criteria, including a collection of high-intonation segments, a collection of fast-speech segments, and a collection of long-pause segments.

The open questions include: participants in the meeting, topics discussed in the meeting, key events related to the topics, and consensus reached on related topics;

Among them, the input form of the large language model is expressed as , where K represents the key information at the text level and the key information at the acoustic level, represents the transcript, Q represents the open question, The answer information is represented, and the key information at the text level and the key information at the acoustic level are assigned set different weights.

2. The method according to claim 1, wherein the content of the significant speech features that meet the set criteria is obtained using K-Means clustering.

3. The method according to claim 1, wherein the key information at the text level is extracted using the MDERank model, including: sorting candidate keywords based on the embedding similarity between the source document and the mask document; and selecting a set number of keyword combinations as the key information at the text level based on the sorting results.

4 . The method according to claim 1 , wherein, during the process of generating the conference summary, the method further comprises providing an interface for user modification.

5. The method according to claim 1, wherein the key information at the text level, the key information at the acoustic level, the transcription text, the open questions and the answer information are weighted and integrated in an overlapping manner in the time dimension.

6. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 5 are implemented.

7. A computer device comprising a memory and a processor, wherein a computer program capable of being run on the processor is stored in the memory, wherein the processor implements the steps of the method according to any one of claims 1 to 5 when executing the computer program.