Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
The existing conference abstract generation model is almost completely focused on text information in the voice transcription text, and other auxiliary language information such as intonation, body language, facial expression and the like in the conference process is ignored. However, these secondary linguistic cues may convey more rich information and aid in understanding context and perception of critical information. The invention explores the method for generating the conference abstract by utilizing the acoustic modal information for the first time, and extracts key information of a voice layer from the conference record by using a machine learning method with high interpretability, such as accent-emphasized spoken content with higher intonation, proper pause and higher speech speed, and the content can be well adapted to different speakers.
In general, referring to fig. 1, the provided method for generating a meeting abstract based on acoustic features of a secondary language first transcribes a section of a meeting recording (a) into text (B), and then processes (C) the meeting recording and the transcribed text to extract acoustic features of the language. After processing, text with key phrase labels (including importance) is obtained, and then the final meeting abstract (E) is output after processing (D) through a large model.
Specifically, as shown in fig. 1 and fig. 2, the provided conference abstract generating method based on the acoustic features of the secondary language comprises the following steps:
and S1, converting conference audio into text, and extracting key information of a text layer by using a deep learning model.
The key phrase is the core of the text segment and is important to the overall text semantics. The research shows that the keywords are core parts in sentences and can be used as indexes for selecting important sentences. In order to avoid resource waste caused by training a model as much as possible and realize efficient meeting abstract generation, an unsupervised model can be adopted to extract key phrases.
In one embodiment, extracting text key information includes first transcribing audio into text, using an unsupervised keyword extraction model, to obtain key information. For example, the keyword extraction model adopts MDERank model (BERT-based keyword extraction method), and the candidate keywords are ranked by utilizing BERT embedded similarity between the source document and the mask document, so that the keyword extraction model has stronger performance and generalization capability. Herein, for each speaker, K i is used to represent the text keyword set for that speaker.
It should be noted that other deep learning models may be used to extract key information of the text, such as KeyBert, textRank.
And S2, extracting key information of an acoustic layer from conference audio, wherein the key information is mainly used for highlighting text contents with large speed change and obvious pause of the intonation.
For example, the fundamental frequency f 0 information of the voice is extracted from the conference record, the speech speed of each sentence is calculated, and the parts with large variation and obvious pause of the intonation and the speech speed of each speaker are selected by using a clustering algorithm to serve as key information of an acoustic layer.
As shown in fig. 3, the left graph is a normal intonation and the right graph is an emphasized intonation, and the fundamental frequency f 0 of the right graph is clearly seen to be higher overall and more fluctuating.
In one embodiment, extracting key information for an acoustic layer includes the steps of:
In step S201, the fundamental frequency is extracted.
The fundamental frequency (f 0) of speech is a key indicator describing intonation changes. The fundamental frequency reflects the lowest frequency component of the sound and is used to analyze the up-down variation of the intonation. For example, for all speech segments of each speakerFirst, the fundamental frequency value of each segment is extracted by a fundamental frequency extraction algorithm pYIN, and then the average fundamental frequency of each segment is calculatedFor subsequent intonation analysis and processing.
Step S202, calculating the speech rate.
The speech rate is an important indicator for measuring the fluency and understandability of a language. For example, the speech rate is calculated by the ratio of the number of words per speech segment to the length of the speech. Specifically, for each speech segmentFirst, the word number of the transcribed text is calculatedAnd then by calculating the durationFinally obtain the speech rate
In step S203, a pause time is calculated.
The pauses play an important role in speech and can effectively enhance the expression of important content. For example, in processing pauses, a pause time calculation method based on word units is employed. For each speech segmentFirst obtain its transcribed textThe pause time before each word is then calculated based on the start time and end time of each word
Step S204, based on the intonation, the speech speed and the pause duration, a speech segment with significant features is extracted.
To automatically identify which speech features (e.g., high intonation, fast speech, long pauses) meet the criteria of "high", "fast" and "long", a clustering method in unsupervised learning may be employed. Considering that the way of fixing the threshold is not universally applicable due to the natural variance of each speaker, in one embodiment, a K-Means clustering algorithm is employed to automatically identify and extract speech segments with significant features. For example, for each speaker S i, three feature sets are obtained, a high intonation segment set, a fast-speed segment set, and a long pause segment set.
In summary, by focusing on acoustic mode information, a number of important acoustic features are extracted. These features help to understand the different emotions, attitudes, and key information of conference participants. This is because when people want to emphasize a certain point of view or emphasize specific information, they typically increase intonation, making keywords or key phrases more prominent in speech, making them more important in the whole sentence or paragraph. Such intonation variations may help listeners better understand and remember the information conveyed.
And S3, guiding the large language model to answer by designing an openness question, and mining clues and important information of the conference.
In order to generate a more specialized meeting summary using a large language model, the design thought chain optimizes prompts input to the large language model.
For example, by guiding the LLM to answer multiple questions to mine clues and important information for meetings, the information of open questions is found more efficient by analysis than the information of structured questions. Thus, in one embodiment, four patency questions (Q) are designed to help LLM mine critical information and cues in the meeting. Four problems of design are as follows:
q1: who the participants of the conference are;
Q2. This conference discusses which topics;
q3, during the conference, which key events related to the theme are;
Q4 which consensus is reached on the relevant topic on the meeting.
The method has the advantages of cross-content applicability and effectiveness by designing the openness problem.
And S4, integrating the key information of the text layer, the key information of the acoustic layer, the transcribed text and the guiding problem to generate a abstract.
In this step, a final summary is generated by integrating the text key information, the voice key information, the transcribed text, and the guidance questions. For example, the LLM is provided with text keywords, larger intonation and speech speed fluctuations, and content of long pauses as key information, and different weights are assigned to these key information. And then, weighting the key information in a time dimension overlapping mode to obtain the final importance degree of the text, and embedding the final importance degree into the prompt word.
Specifically, LLM answers (R) to four questions are used for further information integration. Next, key information (including text level and speech level), conference transcript text, guided questions, and LLM answers are spliced to form a new input form K, D, Q, R, and input into LLM to generate a final summary, note that the text and acoustic level key information is assigned different weights.
To sum up, in order to improve the adaptability of LLM to different conference contents and reduce the training cost to meet the original purpose of using LLM, a weighting method is adopted to integrate text key information and voice key information, and guide problems related to conference elements are embedded to restrict and guide LLM to generate conference digests with higher quality. It should be noted that this constraint approach prior to LLM can seamlessly transition to more advanced LLM.
Further, the entire meeting summary generation flow may be packaged and an application program, or Paralinguistic-SUM application, may be developed based on human-machine collaboration concepts. In the whole abstract generation process, the user can participate in and interact with the system
Because Artificial Intelligence (AI) may be erroneous and each user's interpretation of the results is subjective, the user may not be fully satisfied with the labeling results given by AI. If the user is not provided with the opportunity to adjust the labeling results, the quality of the generated meeting summary and the user satisfaction may be directly affected. For the reasons described above, the patency of the labeling results is maintained, allowing the user to make adjustments. By providing the user with the function of adjusting the annotation result, the quality of the conference abstract and the high satisfaction of the user are ensured.
In order to reduce the interaction burden between a user and a system and improve the use convenience, the designed interaction interface of the Paralinguistic-SUM application program comprises synchronous display of audio playing and transcribed text, key phrase annotation adjustment, answer modification functions and the like. For example, interface A provides browsing functionality for conference audio, supports variable speed playback, and contains synchronized transcription text. The interface B and the interface C are respectively used for the user to adjust the replies given by the key phrases and the large language model aiming at the guiding problem.
Specifically, for the audio playback and transcript text synchronous display function, the application allows the user to browse conference audio and support variable speed playback. At the same time, the audio transcript text is synchronized with the audio playback and time stamps are added. For the key phrase annotation adjustment function, the user may modify the AI-generated key phrase, with the annotated key phrase being distinguished by different background colors, e.g., red for most importance, yellow for medium importance, and green for least importance. The user may modify the level of phrases at the portion that needs to be adjusted, add new key phrases, or cancel existing key phrase labels. For the answer modification function, the user can modify the AI's answer to the introductory questions, ensuring that the generated summary is more consistent with the actual content. The application tags the annotations generated by the AI to help distinguish the annotations of the AI from the user-modified portions.
In summary, as the invention is used as a tool for cooperating with human beings, errors can be generated, and users can play their subjective activity by participating in the process of generating the abstract of the meeting, thereby improving the quality of generating the abstract. The designed application program can be embedded into a smart phone or other intelligent devices, so that the user can easily inject own understanding and knowledge, a modification channel of all contents is provided for the user, and friendly interaction is provided.
To further verify the effect of the present invention, the results of the generation after using the secondary language acoustic feature and the results of the generation without using the secondary language acoustic feature are compared for the same piece of text. Through analysis and comparison, the abstract generated based on the acoustic features of the secondary language can transmit the intention of the speaker more because of considering acoustic features such as speech speed, intonation and the like.
In summary, the invention provides a method for enhancing the conference abstract capability based on a large language model based on the acoustic features of the auxiliary language, which fully utilizes the emotion, attitude, importance degree and other information contained in the voice information by integrating the acoustic feature information of the auxiliary language into the conference abstract generation process. And the acoustic mode information is utilized to assist in generating the conference abstract, so that the accuracy and the readability of the conference abstract are effectively improved.
The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++, python, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.