[go: up one dir, main page]

CN119807407B - A conference summary generation method based on paralinguistic acoustic features - Google Patents

A conference summary generation method based on paralinguistic acoustic features

Info

Publication number
CN119807407B
CN119807407B CN202411880573.8A CN202411880573A CN119807407B CN 119807407 B CN119807407 B CN 119807407B CN 202411880573 A CN202411880573 A CN 202411880573A CN 119807407 B CN119807407 B CN 119807407B
Authority
CN
China
Prior art keywords
information
text
conference
key information
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411880573.8A
Other languages
Chinese (zh)
Other versions
CN119807407A (en
Inventor
王璐
李奕龙
陈凯鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202411880573.8A priority Critical patent/CN119807407B/en
Publication of CN119807407A publication Critical patent/CN119807407A/en
Application granted granted Critical
Publication of CN119807407B publication Critical patent/CN119807407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

本发明公开了一种基于副语言声学特征的会议摘要生成方法。该方法包括:将会议音频转换为文本,并利用深度学习模型,提取文本层面的关键信息;从会议音频中提取声学层面的关键信息,获得满足设定标准的显著性语音特征的内容,所述显著性语音特征基于语调、语速和停顿时长确定;以挖掘会议的线索为目标,设计开放性问题,以引导大型语言模型进行回答,获得回答信息;对文本层面的关键信息、声学层面的关键信息、转录文本、开放性问题和回答信息进行整合,输入到大型语言模型,生成会议摘要。本发明提高了会议摘要生成的质量和效率。

This invention discloses a method for generating conference summaries based on paralinguistic acoustic features. The method includes: converting conference audio into text and extracting key information at the text level using a deep learning model; extracting key acoustic information from the conference audio to obtain content with significant speech features that meet set standards, where the significant speech features are determined based on intonation, speech rate, and pause duration; designing open-ended questions with the goal of uncovering clues about the conference to guide a large language model in answering and obtaining answer information; and integrating the key textual information, key acoustic information, transcribed text, open-ended questions, and answer information, and inputting the information into the large language model to generate a conference summary. This invention improves the quality and efficiency of conference summary generation.

Description

Conference abstract generation method based on acoustic features of secondary language
Technical Field
The invention relates to the technical field of acoustic feature recognition, in particular to a conference abstract generation method based on accessory language acoustic features.
Background
Conference is the main way of information exchange in the job site. During a conference, the conversation of the speaker is typically unstructured or unedited, often in real-time without prior preparation. Thus, important information in a meeting is typically less dense, and the words used may appear scattered, repeated, and inconsistent. In other words, the speech at the conference may contain a lot of "noise". In this case, the meeting summary becomes very important, which not only provides a basis for participants to review and trace back the meeting history, but also enables people not participating in the meeting to quickly learn important contents and conclusions of the meeting. However, organizing meeting offerings is a challenging task. The recorder needs to record the key information in the meeting rapidly, and in the process, the recorder faces various difficulties such as different speech speeds of the participants, professional knowledge understanding, emergency handling and the like. After the meeting is completed, the summary needs to be written based on notes taken during the meeting and understanding and recall the meeting content.
In the prior art, the automatic abstract generation scheme mainly comprises two types, namely an extraction type abstract and a generation type abstract. Early, some studies focused on the extraction of meeting summaries, for example, by intercepting some of the portions of text that make up the summary. However, the meeting abstract obtained by this way of extraction is poorly readable and difficult for third-party readers to understand. The conference abstract obtained based on the generation formula is clear and concise in description, and the readability is much higher. The advent of transducer-based pre-training models, such as BERT, BART, and T5, has led to significant improvements in Natural Language Generation (NLG) tasks in the early years. Despite their great success in generating tasks, there is still a disadvantage in that a complete corpus is required. The current work on meeting summaries is mainly focused on the supervision approach, which makes training models inevitably dependent on large corpora.
One potential approach to solving the data limitation problem is to turn to meeting summarization techniques that rely on weaker supervision. In recent years, the popularity of large-scale language models (LLMs) and their powerful generalization capabilities have created new opportunities for meeting abstracts. LLM is able to understand world knowledge and context and can be used directly to get meeting summaries without any additional training. However, if the meeting record text is directly submitted to LLM processing, the result of the generation is based on the text modality completely, which means that the key information represented by other modalities may be ignored, resulting in that the summary generated finally is not emphasized and has poor quality.
In view of the foregoing, there is a need for improvements in the art to further improve the quality and readability of automatic digest generation.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a conference abstract generation method based on the acoustic features of the auxiliary language. The method comprises the following steps:
Converting conference audio into text, obtaining transcribed text, and extracting key information of a text layer by using a deep learning model;
Extracting key information of an acoustic layer from conference audio to obtain content of salient voice features meeting set standards, wherein the salient voice features are determined based on intonation, speech speed and pause time;
aiming at the clue of the conference, designing an open question to guide a large language model to answer so as to obtain answer information;
And integrating the key information of the text layer, the key information of the acoustic layer, the transcribed text, the open questions and the answer information, and inputting the integrated information into the large language model to generate a meeting abstract.
Compared with the prior art, the conference abstract generation method based on the auxiliary language acoustic features has the advantages that the keyword information of a text layer is firstly extracted by means of a deep learning method, acoustic feature information such as intonation, speed and pause is then extracted from conference audio, a voice segment with obvious features is identified, an open question is further designed to guide a large language model to mine the keyword information and clues in the conference, and then the text keyword, the acoustic keyword information, a transcribed text and answers of the large language model are integrated to generate a final conference abstract. According to the method, the auxiliary language acoustic characteristic information is integrated into the conference abstract generation process, emotion, attitude, importance degree and other information contained in the voice information are fully utilized, and the accuracy and the readability of the conference abstract are effectively improved.
Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a process diagram of a method for generating a meeting summary based on acoustic features of a secondary language, according to one embodiment of the invention;
FIG. 2 is a flow chart of a method of meeting summary generation based on acoustic features of a secondary language in accordance with one embodiment of the present invention;
FIG. 3 is a diagram illustrating the comparison of the variation of the fundamental frequency in a normal intonation versus an emphasized intonation according to one embodiment of the present invention;
In the drawings, SPEECH KEYPHASE-voice key fragments, intonation Recognition-intonation Recognition, long Stop AND HIGH SPEED Recognition-Long Stop and quick voice Recognition and Keyphase Extraction-key fragment extraction.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
The existing conference abstract generation model is almost completely focused on text information in the voice transcription text, and other auxiliary language information such as intonation, body language, facial expression and the like in the conference process is ignored. However, these secondary linguistic cues may convey more rich information and aid in understanding context and perception of critical information. The invention explores the method for generating the conference abstract by utilizing the acoustic modal information for the first time, and extracts key information of a voice layer from the conference record by using a machine learning method with high interpretability, such as accent-emphasized spoken content with higher intonation, proper pause and higher speech speed, and the content can be well adapted to different speakers.
In general, referring to fig. 1, the provided method for generating a meeting abstract based on acoustic features of a secondary language first transcribes a section of a meeting recording (a) into text (B), and then processes (C) the meeting recording and the transcribed text to extract acoustic features of the language. After processing, text with key phrase labels (including importance) is obtained, and then the final meeting abstract (E) is output after processing (D) through a large model.
Specifically, as shown in fig. 1 and fig. 2, the provided conference abstract generating method based on the acoustic features of the secondary language comprises the following steps:
and S1, converting conference audio into text, and extracting key information of a text layer by using a deep learning model.
The key phrase is the core of the text segment and is important to the overall text semantics. The research shows that the keywords are core parts in sentences and can be used as indexes for selecting important sentences. In order to avoid resource waste caused by training a model as much as possible and realize efficient meeting abstract generation, an unsupervised model can be adopted to extract key phrases.
In one embodiment, extracting text key information includes first transcribing audio into text, using an unsupervised keyword extraction model, to obtain key information. For example, the keyword extraction model adopts MDERank model (BERT-based keyword extraction method), and the candidate keywords are ranked by utilizing BERT embedded similarity between the source document and the mask document, so that the keyword extraction model has stronger performance and generalization capability. Herein, for each speaker, K i is used to represent the text keyword set for that speaker.
It should be noted that other deep learning models may be used to extract key information of the text, such as KeyBert, textRank.
And S2, extracting key information of an acoustic layer from conference audio, wherein the key information is mainly used for highlighting text contents with large speed change and obvious pause of the intonation.
For example, the fundamental frequency f 0 information of the voice is extracted from the conference record, the speech speed of each sentence is calculated, and the parts with large variation and obvious pause of the intonation and the speech speed of each speaker are selected by using a clustering algorithm to serve as key information of an acoustic layer.
As shown in fig. 3, the left graph is a normal intonation and the right graph is an emphasized intonation, and the fundamental frequency f 0 of the right graph is clearly seen to be higher overall and more fluctuating.
In one embodiment, extracting key information for an acoustic layer includes the steps of:
In step S201, the fundamental frequency is extracted.
The fundamental frequency (f 0) of speech is a key indicator describing intonation changes. The fundamental frequency reflects the lowest frequency component of the sound and is used to analyze the up-down variation of the intonation. For example, for all speech segments of each speakerFirst, the fundamental frequency value of each segment is extracted by a fundamental frequency extraction algorithm pYIN, and then the average fundamental frequency of each segment is calculatedFor subsequent intonation analysis and processing.
Step S202, calculating the speech rate.
The speech rate is an important indicator for measuring the fluency and understandability of a language. For example, the speech rate is calculated by the ratio of the number of words per speech segment to the length of the speech. Specifically, for each speech segmentFirst, the word number of the transcribed text is calculatedAnd then by calculating the durationFinally obtain the speech rate
In step S203, a pause time is calculated.
The pauses play an important role in speech and can effectively enhance the expression of important content. For example, in processing pauses, a pause time calculation method based on word units is employed. For each speech segmentFirst obtain its transcribed textThe pause time before each word is then calculated based on the start time and end time of each word
Step S204, based on the intonation, the speech speed and the pause duration, a speech segment with significant features is extracted.
To automatically identify which speech features (e.g., high intonation, fast speech, long pauses) meet the criteria of "high", "fast" and "long", a clustering method in unsupervised learning may be employed. Considering that the way of fixing the threshold is not universally applicable due to the natural variance of each speaker, in one embodiment, a K-Means clustering algorithm is employed to automatically identify and extract speech segments with significant features. For example, for each speaker S i, three feature sets are obtained, a high intonation segment set, a fast-speed segment set, and a long pause segment set.
In summary, by focusing on acoustic mode information, a number of important acoustic features are extracted. These features help to understand the different emotions, attitudes, and key information of conference participants. This is because when people want to emphasize a certain point of view or emphasize specific information, they typically increase intonation, making keywords or key phrases more prominent in speech, making them more important in the whole sentence or paragraph. Such intonation variations may help listeners better understand and remember the information conveyed.
And S3, guiding the large language model to answer by designing an openness question, and mining clues and important information of the conference.
In order to generate a more specialized meeting summary using a large language model, the design thought chain optimizes prompts input to the large language model.
For example, by guiding the LLM to answer multiple questions to mine clues and important information for meetings, the information of open questions is found more efficient by analysis than the information of structured questions. Thus, in one embodiment, four patency questions (Q) are designed to help LLM mine critical information and cues in the meeting. Four problems of design are as follows:
q1: who the participants of the conference are;
Q2. This conference discusses which topics;
q3, during the conference, which key events related to the theme are;
Q4 which consensus is reached on the relevant topic on the meeting.
The method has the advantages of cross-content applicability and effectiveness by designing the openness problem.
And S4, integrating the key information of the text layer, the key information of the acoustic layer, the transcribed text and the guiding problem to generate a abstract.
In this step, a final summary is generated by integrating the text key information, the voice key information, the transcribed text, and the guidance questions. For example, the LLM is provided with text keywords, larger intonation and speech speed fluctuations, and content of long pauses as key information, and different weights are assigned to these key information. And then, weighting the key information in a time dimension overlapping mode to obtain the final importance degree of the text, and embedding the final importance degree into the prompt word.
Specifically, LLM answers (R) to four questions are used for further information integration. Next, key information (including text level and speech level), conference transcript text, guided questions, and LLM answers are spliced to form a new input form K, D, Q, R, and input into LLM to generate a final summary, note that the text and acoustic level key information is assigned different weights.
To sum up, in order to improve the adaptability of LLM to different conference contents and reduce the training cost to meet the original purpose of using LLM, a weighting method is adopted to integrate text key information and voice key information, and guide problems related to conference elements are embedded to restrict and guide LLM to generate conference digests with higher quality. It should be noted that this constraint approach prior to LLM can seamlessly transition to more advanced LLM.
Further, the entire meeting summary generation flow may be packaged and an application program, or Paralinguistic-SUM application, may be developed based on human-machine collaboration concepts. In the whole abstract generation process, the user can participate in and interact with the system
Because Artificial Intelligence (AI) may be erroneous and each user's interpretation of the results is subjective, the user may not be fully satisfied with the labeling results given by AI. If the user is not provided with the opportunity to adjust the labeling results, the quality of the generated meeting summary and the user satisfaction may be directly affected. For the reasons described above, the patency of the labeling results is maintained, allowing the user to make adjustments. By providing the user with the function of adjusting the annotation result, the quality of the conference abstract and the high satisfaction of the user are ensured.
In order to reduce the interaction burden between a user and a system and improve the use convenience, the designed interaction interface of the Paralinguistic-SUM application program comprises synchronous display of audio playing and transcribed text, key phrase annotation adjustment, answer modification functions and the like. For example, interface A provides browsing functionality for conference audio, supports variable speed playback, and contains synchronized transcription text. The interface B and the interface C are respectively used for the user to adjust the replies given by the key phrases and the large language model aiming at the guiding problem.
Specifically, for the audio playback and transcript text synchronous display function, the application allows the user to browse conference audio and support variable speed playback. At the same time, the audio transcript text is synchronized with the audio playback and time stamps are added. For the key phrase annotation adjustment function, the user may modify the AI-generated key phrase, with the annotated key phrase being distinguished by different background colors, e.g., red for most importance, yellow for medium importance, and green for least importance. The user may modify the level of phrases at the portion that needs to be adjusted, add new key phrases, or cancel existing key phrase labels. For the answer modification function, the user can modify the AI's answer to the introductory questions, ensuring that the generated summary is more consistent with the actual content. The application tags the annotations generated by the AI to help distinguish the annotations of the AI from the user-modified portions.
In summary, as the invention is used as a tool for cooperating with human beings, errors can be generated, and users can play their subjective activity by participating in the process of generating the abstract of the meeting, thereby improving the quality of generating the abstract. The designed application program can be embedded into a smart phone or other intelligent devices, so that the user can easily inject own understanding and knowledge, a modification channel of all contents is provided for the user, and friendly interaction is provided.
To further verify the effect of the present invention, the results of the generation after using the secondary language acoustic feature and the results of the generation without using the secondary language acoustic feature are compared for the same piece of text. Through analysis and comparison, the abstract generated based on the acoustic features of the secondary language can transmit the intention of the speaker more because of considering acoustic features such as speech speed, intonation and the like.
In summary, the invention provides a method for enhancing the conference abstract capability based on a large language model based on the acoustic features of the auxiliary language, which fully utilizes the emotion, attitude, importance degree and other information contained in the voice information by integrating the acoustic feature information of the auxiliary language into the conference abstract generation process. And the acoustic mode information is utilized to assist in generating the conference abstract, so that the accuracy and the readability of the conference abstract are effectively improved.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++, python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (7)

1.一种基于副语言声学特征的会议摘要生成方法,包括以下步骤:1. A method for generating conference summaries based on paralinguistic acoustic features, comprising the following steps: 将会议音频转换为文本,获得转录文本,并利用深度学习模型,提取文本层面的关键信息;Convert meeting audio to text, obtain transcripts, and use deep learning models to extract key information at the text level; 从会议音频中提取声学层面的关键信息,获得满足设定标准的显著性语音特征的内容,所述显著性语音特征基于语调、语速和停顿时长确定;Extracting key acoustic information from the conference audio to obtain content with significant speech features that meet set criteria, where the significant speech features are determined based on intonation, speech rate, and pause duration; 以挖掘会议的线索为目标,设计开放性问题,以引导大型语言模型进行回答,获得回答信息;With the goal of uncovering clues from the meeting, we designed open-ended questions to guide the large language model to answer and obtain answer information. 对所述文本层面的关键信息、所述声学层面的关键信息、所述转录文本、所述开放性问题和所述回答信息进行整合,输入到所述大型语言模型,生成会议摘要;Integrating the key information at the text level, the key information at the acoustic level, the transcript, the open questions, and the answer information, and inputting the results into the large language model to generate a meeting summary; 其中,根据以下步骤获得满足设定标准的显著性语音特征的内容:The following steps are used to obtain content with significant speech features that meet the set criteria: 对于说话者的各语音片段,提取每个语音片段的基频值,并通过与平均基频进行对比,得到语调的变化程度信息,其中下标i是说话者索引,上标1至m是语音片段索引;For each speech segment of the speaker , extract the fundamental frequency value of each speech segment and obtain the degree of intonation change by comparing it with the average fundamental frequency, where the subscript i is the speaker index and the superscripts 1 to m are the speech segment indexes; 对于每个语音片段,计算其转录文本的字数,并通过计算其持续时间,得到语速For each speech segment , calculate the word count of its transcribed text and calculate its duration by , get the speaking speed ; 对于每个语音片段,获得其转录文本,并基于每个单词的起始时间和结束时间计算出每个单词前的停顿时间For each speech segment , get its transcript , and calculate the pause time before each word based on the start time and end time of each word ; 基于获得的语调变化程度信息、语速和停顿时间,提取出满足设定标准的显著性语音特征的内容,包括高语调片段集合,快语速片段集合,长停顿片段集合;Based on the acquired information on intonation variation, speaking speed, and pause time, the system extracts significant speech features that meet the set criteria, including a collection of high-intonation segments, a collection of fast-speech segments, and a collection of long-pause segments. 其中,所述开放性问题包括:会议的参与者、会议讨论的主题、与主题相关的关键事件、在相关主题上的达成的共识;The open questions include: participants in the meeting, topics discussed in the meeting, key events related to the topics, and consensus reached on related topics; 其中,输入到所述大型语言模型的输入形式表示为,其中K表示所述文本层面的关键信息和所述声学层面的关键信息,表示所述转录文本,Q表示所述开放性问题,表示所述回答信息,并且所述文本层面的关键信息和所述声学层面的关键信息被分配了设定的不同权重。Among them, the input form of the large language model is expressed as , where K represents the key information at the text level and the key information at the acoustic level, represents the transcript, Q represents the open question, The answer information is represented, and the key information at the text level and the key information at the acoustic level are assigned set different weights. 2.根据权利要求1所述的方法,其中,所述满足设定标准的显著性语音特征的内容利用K-Means聚类获得。2. The method according to claim 1, wherein the content of the significant speech features that meet the set criteria is obtained using K-Means clustering. 3.根据权利要求1所述的方法,其中,所述文本层面的关键信息采用MDERank模型提取,包括:基于源文档和掩码文档之间的嵌入相似度对候选关键词进行排序;根据排序结果选择设定数量的关键词组合为所述文本层面的关键信息。3. The method according to claim 1, wherein the key information at the text level is extracted using the MDERank model, including: sorting candidate keywords based on the embedding similarity between the source document and the mask document; and selecting a set number of keyword combinations as the key information at the text level based on the sorting results. 4.根据权利要求1所述的方法,其中,在所述会议摘要生成过程中,还包括提供界面供用户修改。4 . The method according to claim 1 , wherein, during the process of generating the conference summary, the method further comprises providing an interface for user modification. 5.根据权利要求1所述的方法,其中,通过时间维度重叠的方式加权整合所述文本层面的关键信息、所述声学层面的关键信息、所述转录文本、所述开放性问题和所述回答信息。5. The method according to claim 1, wherein the key information at the text level, the key information at the acoustic level, the transcription text, the open questions and the answer information are weighted and integrated in an overlapping manner in the time dimension. 6.一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现根据权利要求1至5中任一项所述的方法的步骤。6. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 5 are implemented. 7.一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至5中任一项所述的方法的步骤。7. A computer device comprising a memory and a processor, wherein a computer program capable of being run on the processor is stored in the memory, wherein the processor implements the steps of the method according to any one of claims 1 to 5 when executing the computer program.
CN202411880573.8A 2024-12-19 2024-12-19 A conference summary generation method based on paralinguistic acoustic features Active CN119807407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411880573.8A CN119807407B (en) 2024-12-19 2024-12-19 A conference summary generation method based on paralinguistic acoustic features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411880573.8A CN119807407B (en) 2024-12-19 2024-12-19 A conference summary generation method based on paralinguistic acoustic features

Publications (2)

Publication Number Publication Date
CN119807407A CN119807407A (en) 2025-04-11
CN119807407B true CN119807407B (en) 2025-10-17

Family

ID=95257750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411880573.8A Active CN119807407B (en) 2024-12-19 2024-12-19 A conference summary generation method based on paralinguistic acoustic features

Country Status (1)

Country Link
CN (1) CN119807407B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119003759A (en) * 2024-08-22 2024-11-22 航天物联网技术有限公司 Conference summary generation method based on large language model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11929074B2 (en) * 2021-02-11 2024-03-12 Dell Products L.P. Automatically generating a meeting summary for an information handling system
CN117786098B (en) * 2024-02-26 2024-05-07 深圳波洛斯科技有限公司 Telephone recording abstract extraction method and device based on multi-mode large language model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119003759A (en) * 2024-08-22 2024-11-22 航天物联网技术有限公司 Conference summary generation method based on large language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Sameer Maskey等.Comparing Lexical,Acoustic/Prosodic,Structural and Discourse Features for Speech Summarization.INTERSPEECH 2005.2005,第621-624页. *

Also Published As

Publication number Publication date
CN119807407A (en) 2025-04-11

Similar Documents

Publication Publication Date Title
US11990132B2 (en) Automated meeting minutes generator
US11545156B2 (en) Automated meeting minutes generation service
US9501470B2 (en) System and method for enriching spoken language translation with dialog acts
Vinnarasu et al. Speech to text conversion and summarization for effective understanding and documentation
WO2019100350A1 (en) Providing a summary of a multimedia document in a session
CN101075435A (en) Intelligent chatting system and its realizing method
US20210056956A1 (en) Data-driven and rule-based speech recognition output enhancement
CN110765270B (en) Training method and system of text classification model for spoken language interaction
Blache et al. The corpus of interactional data: A large multimodal annotated resource
US20210264812A1 (en) Language learning system and method
CN103885924A (en) Field-adaptive automatic open class subtitle generating system and field-adaptive automatic open class subtitle generating method
CN119961482B (en) Method and system for audio semantic retrieval of law enforcement recorder based on retrieval enhancement
CN114503100A (en) Method and device for labeling emotion related metadata to multimedia file
CN119807407B (en) A conference summary generation method based on paralinguistic acoustic features
Marklynn et al. A framework for abstractive summarization of conversational meetings
CN119088952A (en) Meeting minutes generation method, electronic device and readable storage medium
Mai et al. Mnv-17: A high-quality performative mandarin dataset for nonverbal vocalization recognition in speech
CN119011750A (en) Method for editing video, related device and computer program product
Muppidi et al. Automatic meeting minutes generation using Natural Language processing
Raut et al. An extensive survey on audio-to-text and text summarization for video content
CN120340475A (en) Meeting minutes generation method, device, computer readable medium and electronic device
Wang et al. Incorporating contextual paralinguistic understanding in large speech-language models
Svongoro et al. Some methodological issues in language research: Dealing with transcribed interpreted courtroom data
Minker et al. Spoken dialogue systems technology and design
Islam et al. Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant