CN114079695A

CN114079695A - Method, device and storage medium for recording voice call content

Info

Publication number: CN114079695A
Application number: CN202010833065.XA
Authority: CN
Inventors: 李宽; 王洪斌; 蒋宁; 吴海英; 权圣
Original assignee: Beijing Finite Element Technology Co Ltd
Current assignee: Beijing Finite Element Technology Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-02-22

Abstract

The present application discloses a method, a device and a storage medium for recording voice call content. Wherein, a method for recording the content of a voice call includes: during a voice call between a robot program and a target user, determining text information corresponding to part of the audio during the voice call, wherein the part of the audio includes the robot in at least one voice broadcast period. The voice broadcasted by the program and/or the voice received from the target user in at least one voice receiving period; according to the time information corresponding to the voice of at least one voice broadcasting period and/or at least one voice receiving period, determine the start corresponding to the starting point of the text information time information and end time information corresponding to the end point of the text information; and record start time information and end time information of the text information.

Description

Method, device and storage medium for recording voice call content

Technical Field

The present invention relates to the field of intelligent voice call technologies, and in particular, to a method, an apparatus, and a storage medium for recording voice call content.

Background

The intelligent chat robot is widely applied in various fields at present, including intelligent customer service, intelligent sound boxes, entertainment products and the like. As an advanced form, the intelligent voice robot is more and more favored by the industry in a more natural and convenient interactive manner (voice). After the intelligent voice robot calls out or receives an incoming call, the front page of the intelligent voice management platform can check the chat records (text records of the words spoken by the target user and the robot, layout similar to WeChat chat records) and the whole call record of the call. However, if a user wants to listen to a recording corresponding to a certain text (for example, the recording is long, and the user does not want to listen to all texts), the user can only drag the progress bar manually, and this method is inefficient and inaccurate. Secondly, if the whole call process is recorded in a segmented manner, the recording, storage and listening of the recording depend on java application and FS components of a call center, so that the problems of large change, high cost, compatibility with historical recording, storage and correspondence of multi-segment recording and the like exist. In addition, the later stage of the whole recording can be cut, wherein if a manual cutting mode is used, the cost is too high and is not available at all, if a voice algorithm automatic processing mode is used, the cost of the algorithm development and tuning line is high, the period is long, and the robustness and the accuracy cannot be guaranteed by 100%, and the cut multi-section recording also has the problems of compatibility with the historical recording, storage and correspondence of the multi-section recording and the like.

Aiming at the technical problems that the existing mode of storing the recording and the call record in the intelligent call process in an associated manner has high cost and is difficult to be compatible with the historical recording and the storage and the corresponding of multi-section recording are difficult in the prior art, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device and a storage medium for recording voice call content, which at least solve the technical problems of high cost, difficulty in compatibility with historical recording and difficulty in storage and correspondence of multi-section recording in the existing mode of storing the recording and call record in the intelligent call process in the prior art.

According to an aspect of the embodiments of the present disclosure, there is provided a method for recording voice call content, including: in the voice communication process of the robot program and the target user, determining text information corresponding to partial audio in the voice communication process, wherein the partial audio comprises voice broadcasted by the robot program in at least one voice broadcasting time interval and/or voice received from the target user in at least one voice receiving time interval; determining start time information corresponding to a starting point of the text information and end time information corresponding to an end point of the text information according to time information corresponding to voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval; and recording start time information and end time information of the text information.

According to another aspect of the embodiments of the present disclosure, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is executed.

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for recording voice call content, including: the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining text information corresponding to partial audio in the voice call process of the robot program and a target user, and the partial audio comprises voice broadcasted by the robot program in at least one voice broadcasting time interval and/or voice received from the target user in at least one voice receiving time interval; the second determining module is used for determining starting time information corresponding to a starting point of the text information and ending time information corresponding to an end point of the text information according to time information corresponding to voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval; and the recording module is used for recording the starting time information and the ending time information of the text information.

According to another aspect of the embodiments of the present disclosure, there is also provided an apparatus for recording voice call content, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: in the voice communication process of the robot program and the target user, determining text information corresponding to partial audio in the voice communication process, wherein the partial audio comprises voice broadcasted by the robot program in at least one voice broadcasting time interval and/or voice received from the target user in at least one voice receiving time interval; determining start time information corresponding to a starting point of the text information and end time information corresponding to an end point of the text information according to time information corresponding to voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval; and recording start time information and end time information of the text information.

In the embodiment of the disclosure, in the voice call process, the computing device determines the voice text information of the broadcast time interval or the receiving time interval, that is, associates the voice with the corresponding text information. The computing device may then record time node information of speech corresponding to each piece of text information in the call audio information. Therefore, the time and position information of the voice corresponding to the text information in the call audio information can be conveniently determined through the time node information in the later period. And the later-stage user can locate the position of the voice corresponding to the text information by clicking the text information. The method achieves the technical effects of not modifying the recording, storing and tuning codes of the call center recording, not using an algorithm to perform post cutting, and ensuring the stability of the system. Meanwhile, the method reduces the cost of intelligent voice modification, is compatible with historical recording, and is easy to realize voice and text correspondence and storage, so that the method is easier to popularize and fall to the ground. And then solved the existing mode of associating the recording and the call record storage in the intelligence conversation process that exists among the prior art, there are with high costs, with history record difficult compatible and the storage of multistage recording with correspond difficult technical problem.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a hardware block diagram of a computing device for implementing the method according to embodiment 1 of the present disclosure;

fig. 2 is a flowchart illustrating a method for recording voice call content according to a first aspect of embodiment 1 of the present disclosure;

fig. 3 is another schematic flow chart of a method for recording voice call content according to a second aspect of embodiment 1 of the present disclosure;

fig. 4 is a functional block diagram of recording voice call content according to the first aspect of embodiment 2 of the present disclosure;

fig. 5 is a schematic diagram of an apparatus for recording voice call content according to a second aspect of embodiment 2 of the present disclosure; and

fig. 6 is a schematic diagram of an apparatus for recording voice call content according to embodiment 3 of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is to be understood that the described embodiments are merely exemplary of some, and not all, of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with the present embodiment, there is provided an embodiment of a method of recording voice call content, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

The method embodiments provided by the present embodiment may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Fig. 1 illustrates a block diagram of a hardware architecture of a computing device for implementing a method of recording voice call content. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the disclosed embodiments, the data processing circuit acts as a processor control (e.g., selection of a variable resistance termination path connected to the interface).

The memory may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for recording voice call content in the embodiments of the present disclosure, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the method for recording voice call content of the application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to talk to a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

In the operating environment described above, according to a first aspect of the present embodiment, a method of recording voice call content is provided. Fig. 2 shows a flow diagram of the method, which, with reference to fig. 2, comprises:

s202: in the voice communication process of the robot program and the target user, determining text information corresponding to partial audio in the voice communication process, wherein the partial audio comprises voice broadcasted by the robot program in at least one voice broadcasting time interval and/or voice received from the target user in at least one voice receiving time interval;

s204: determining start time information corresponding to a starting point of the text information and end time information corresponding to an end point of the text information according to time information corresponding to voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval; and

s206: the start time information and the end time information of the text information are recorded.

As described in the background art, if a user wants to listen to a recording corresponding to a certain text (for example, the recording is long, and the user does not want to listen to all), the user can only drag the progress bar manually, which is inefficient and inaccurate. Secondly, if the whole call process is recorded in a segmented manner, the recording, storage and listening of the recording depend on java application and FS components of a call center, so that the problems of large change, high cost, compatibility with historical recording, storage and correspondence of multi-segment recording and the like exist. In addition, the later stage of the whole recording can be cut, wherein if a manual cutting mode is used, the cost is too high and is not available at all, if a voice algorithm automatic processing mode is used, the cost of the algorithm development and tuning line is high, the period is long, and the robustness and the accuracy cannot be guaranteed by 100%, and the cut multi-section recording also has the problems of compatibility with the historical recording, storage and correspondence of the multi-section recording and the like.

In view of this, when the user needs to play part of the audio information of the intelligent voice call, the computing device may associate part of the text information in the intelligent voice call with the audio information corresponding to the part of the text information while recording the audio of the intelligent voice call, so that the position of the corresponding audio information in the whole call audio information may be found and played by clicking part of the text information at a later stage. First, during the voice call between the robot program and the target user, the computing device determines text information corresponding to a part of audio (the content of one of the round of the dialog between the robot program and the target user) during the voice call, where the part of audio includes voice broadcasted by the robot program during at least one voice broadcast period and/or voice received from the target user during at least one voice receiving period (corresponding to step S202). For example, during an intelligent voice call, the robot and the target user may have multiple rounds of conversations, and thus the computing device may determine text information corresponding to audio information of one of the rounds of conversations during the call, thereby associating the audio information with the text information corresponding thereto.

Further, the computing device determines start time information corresponding to a start point of the text information and end time information corresponding to an end point of the text information, based on time information corresponding to voices of at least one voice broadcast period and/or at least one voice reception period (corresponding to step S204). For example, the computing device may determine the start time information and the end time information of the text information corresponding to the voice by recording the start millisecond timestamp and the end millisecond timestamp corresponding to the voice of the at least one voice broadcast period and/or the at least one voice reception period. Wherein the start time information and the end time information may be time stamp information.

Further, the computing device may record start time information and end time information of the text information (corresponding to step S206). And the time position of the audio information corresponding to the text information in the voice call can be conveniently determined according to the starting time information and the ending time information of the text information at the later stage.

Further, the start time information and the end time information of the text information may be in the form of time stamps, and the start time stamp of the voice call may be recorded. In the latter period, in the case of determining the audio information corresponding to the text, the time position information of the audio information corresponding to the text information may be determined according to the timestamp.

Therefore, according to the above manner, in the voice call process, the computing device determines a section of voice text information of the broadcast time interval or the receiving time interval, that is, associates the voice with the corresponding text information. The computing device then records the start time information and the end time information of the text information by recording the time information of the speech. Therefore, the time and position information of the voice corresponding to the text information in the voice call can be conveniently determined through the starting time information and the ending time information of the text information in the later period. Therefore, the user can locate the position of the voice corresponding to the text information by clicking the text information. The method achieves the technical effects of not modifying the recording, storing and tuning codes of the call center recording, not using an algorithm to perform post cutting, and ensuring the stability of the system. Meanwhile, the method reduces the cost of intelligent voice modification, is compatible with historical recording, and is easy to realize voice and text correspondence and storage, so that the method is easier to popularize and fall to the ground. And then solved the existing mode of associating the recording and the call record storage in the intelligence conversation process that exists among the prior art, there are with high costs, with history record difficult compatible and the storage of multistage recording with correspond difficult technical problem.

In addition, the above-mentioned part of audio and voice are all the dialogue contents one party speaks in a round of dialogue.

Further, the computing device may perform operations in the present invention by using the Lua scripting language. For example, when the target user picks up a call, the Lua issues an instruction to start recording.

Optionally, this embodiment further includes: recording call audio information in the voice call process; and recording start time information and end time information of the text information, further comprising: and recording a start time node and an end time node of the voice corresponding to the text message in the call audio information.

Specifically, the computing device may record the entire voice call process and determine call audio information corresponding to the voice call. Secondly, the computing device can record a start time node and an end time node of the voice corresponding to the text information (the conversation content stated by one party in a round of conversation) in the conversation audio information. By the mode, the time position information of the voice corresponding to the text information is easily positioned in the call audio information when the text information is clicked in the later period, so that the voice can be accurately played.

Optionally, the operation of determining start time information corresponding to a start point of the text message and end time information corresponding to an end point of the text message according to time information corresponding to a voice of at least one voice broadcast period and/or at least one voice reception period includes: and determining the starting time information and the ending time information of the text information according to a first starting time stamp and a first ending time stamp corresponding to the voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval.

Specifically, the computing device may determine a start time stamp and an end time stamp of the text information corresponding thereto from a first start time stamp of the start and a first end time stamp of the end of the speech, so that the start time information and the end time information of the text information may be determined. For example, the start time information may be beijing time 9:05 and end timestamp beijing time 9: 06. therefore, by the mode, the starting time information and the ending time information of the text information can be determined, and the voice corresponding to the text information can be accurately positioned in the later period.

Optionally, the operation of recording a start time node and an end time node of the voice corresponding to the text message in the call audio message includes: recording a second start timestamp of the voice call; and determining a starting time node and an ending time node of the voice corresponding to the text information in the call audio information according to the second starting time stamp, the first starting time stamp and the first ending time stamp.

In particular, the computing device may record a second start timestamp for the start of the voice call (i.e., a start timestamp for the call audio information, e.g., 9: 00). Then, the computing device can determine the time position of the voice in the call audio information according to the first start timestamp and the first end timestamp of the voice, that is, click the text information to find the time position of the voice corresponding to the text information in the call audio information, so as to accurately play the voice. Therefore, the technical effect of accurately positioning the time position of the voice corresponding to the text information in the call audio information is achieved through the mode.

Optionally, the method further includes calculating a start time node and an end time node by the following calculation formulas: the calculation formula of the start time node is as follows: a start time node is a first start timestamp-a second start timestamp; and the calculation formula of the end time node is as follows: the end time node is the first end timestamp to the second start timestamp.

Specifically, for example, the start timestamp of the voice call is 9:00, the first start timestamp of the voice is 9:05, and the first end timestamp of the voice is 9: 07. The computing device can obtain the time node position of the voice in the call audio information through the above computing formula, that is, the fifth minute to the seventh minute of the call audio information are the voice. Therefore, the time node of a section of voice in the call audio information is recorded in the mode, and the technical effect of accurately positioning the time position of the voice corresponding to the text information is achieved.

Optionally, the method further comprises the following operation of realizing voice broadcast: performing text recognition on preset robot interactive text information by using a preset speech synthesis algorithm and converting the preset robot interactive text information into speech; and broadcasting the voice through the robot program.

Specifically, the computing device performs text recognition on preset robot interaction text information by using a preset speech synthesis algorithm and converts the preset robot interaction text information into speech. Wherein the speech synthesis algorithm may be tts (text To speech) speech synthesis, which converts the robot's linguistic text into speech output. And then the voice is broadcasted through the robot program. Thus, the conversion from text to speech is completed in the above manner.

Optionally, the operation of determining text information corresponding to a part of audio in the voice call process includes: and performing voice recognition on the voice received from the target user by using a preset voice recognition algorithm and generating text information.

Specifically, the computing device performs speech recognition on speech received from the target user using a preset speech recognition algorithm and generates text information. The Speech recognition algorithm may be asr (automatic Speech recognition), which converts the Speech signal of the target user into text for processing by the flow engine. Therefore, the technical effect of realizing text conversion of the voice of the target user of the outbound call is achieved through the mode.

In addition, the computing device associates each part of audio information with text information in the call process by means of calling TTS and ARS through Lua. And when the robot program is required to perform voice broadcast, the TTS is called, and after a section of content to be broadcast by the robot program is finished, the computing equipment calls the ARS through the Lua to start recognizing the voice of the target user. And the computing equipment calls the TTS and the ARS in turn in a circulating way to realize the associated storage of the text information and the audio information in the whole communication process.

Further, referring to fig. 1, according to a second aspect of the present embodiment, there is provided a storage medium. The storage medium comprises a stored program, wherein the method of any of the above is performed by a processor when the program is run.

In addition, fig. 3 shows another flowchart of recording voice call content according to the embodiment of the present application, and referring to fig. 3, specific steps of the flowchart are as follows:

1) after the intelligent voice robot program starts to interact with a target user (no matter calling out or calling in, no matter using telephone or network voice, the design is universal), the Lua sends out a recording starting instruction, and simultaneously records a current millisecond timestamp T _ Record as recording starting absolute time;

2) the Lua acquires a TEXT TTS _ TEXT _1 of a beginning word to be spoken by the robot through the process engine, calls the TTS for broadcasting, records a current millisecond timestamp T _ Start _ TTS _1 before calling as the broadcasting starting absolute time of the TTS _1, because the TTS is call blocking, the Lua cannot execute the following instruction during the broadcasting of the TTS, and records the current millisecond timestamp T _ End _ TTS _1 as the broadcasting ending absolute time of the TTS _1 after the broadcasting of the TTS is finished; with reference to the above T _ Record, it can be calculated that the relative time point when the TTS starts to end in the recording (i.e. how many milliseconds of recording the voice corresponding to the text starts to be recorded, and how many milliseconds of the voice ends in the recording), that is, the relative time point when the TTS _1 is broadcasted in the recording is: beginning with T _ ST _1 ═ T _ Start _ TTS _1-T _ Record; and (4) ending: t _ ET _1 — T _ End _ TTS _1-T _ Record;

3) after TTS is broadcasted, Lua calls ASR to Start recognizing the voice of a target user, a current millisecond timestamp T _ Start _ ASR _1 is recorded before calling as the absolute time of ASR _1 recognition Start, the target user finishes speaking, when ASR returns to a Lua recognition result, a current millisecond timestamp T _ End _ ASR _1 is recorded as the absolute time of ASR _1 recognition End, and the relative time point of the ASR _1 spoken by the target user in recording is as follows: starting from T _ SA _1 to T _ Start _ ASR _1 to T _ Record; and (4) ending: t _ EA _1 ═ T _ End _ ASR _1-T _ Record;

4) the TEXT content of the robot speaking TTS _1 is TTS _ TEXT _1, the TEXT result of the target user speaking ASR _1 is ASR _ TEXT _1, the Lua calls a process engine api interface, and the input parameters are as follows: TTS _ TEXT _1, T _ ST _1, T _ ET _1, ASR _ TEXT _1, T _ SA _1 and T _ EA _ 1; the process engine records the recording start and stop points corresponding to the two TEXTs respectively, calls an intention recognition algorithm interface according to the ASR _ TEXT _1 (namely the words spoken by the target user), and returns TTS _ TEXT _2 to be spoken by the robot in the next sentence of Lua according to the intention of the target user and a pre-configured process;

5) lua can circularly execute the steps 2) to 4), parameters are changed into TTS _ TEXT _ i, T _ ST _ i, T _ ET _ i, ASR _ TEXT _ i, T _ SA _ i and T _ EA _ i correspondingly until the process engine moves to an ending node, and the Lua is told to hang up after the TTS _ TEXT _ n is played, or a target user actively hangs up the telephone;

6) because the starting and ending time points of two sections of texts in one round of conversation (the robot says one sentence and the target user says one sentence) are transmitted to the process engine together after the target user finishes speaking, when the on-hook node cannot identify the target user speaking, the process engine api needs to be requested for one time additionally, and the T _ ST _ n and the T _ ET _ n of the last sentence of the language spoken by the machine are transmitted back;

7) if the target user actively hangs up in the mth round of interaction, the target user needs to request the process engine api once additionally, and the TTS _ TEXT _ m, T _ ST _ m, T _ ET _ m, ASR _ TEXT _ i, T _ SA _ i and T _ EA _ i (if the last three project mark users do not speak, they may not) spoken by the last machine are transmitted back;

8) after the whole conversation is finished, the intelligent voice java background can generate a complete conversation record (including the chat record, the start-stop time and the recording link of each text in the record in the recording), the front-end page displays the complete conversation record according to the chat record and the start-stop time and the recording link, and meanwhile, when a user clicks a certain text in the chat record, the user can jump to a corresponding position to play the text.

Thus, in the above manner, 1) can solve the following problems: the recording of the intelligent voice call record is a whole segment, the corresponding recording cannot be automatically positioned by clicking the text of the chat record, only the recording progress bar can be manually dragged, and the efficiency is low and inaccurate; 2) according to the scheme, the automatic recording of the starting and ending time points of each speech (including a target user and a robot) in voice interaction is realized on the premise of not greatly changing the recording, storing and tuning of the existing intelligent voice call records, and the speech is returned to the process engine through the api, so that the function of clicking a certain text in the chat records to automatically position the corresponding record can be realized through the corresponding interface on the front-end page. The method is accurate in control and convenient to realize, and is beneficial to quickly landing on the ground in related service scenes. The defects of the prior art can be overcome: recording in sections in the whole call process, because recording, storing and tuning of the recording depend on java application and FS components of a call center, the modification is large, the cost is high, and the problems of compatibility with historical recording, storage and correspondence of multi-section recording and the like exist; or the later stage of the whole recording is cut, if the manual work is used, the cost is too high and is not available at all, if the voice algorithm is used for automatic processing, the algorithm is developed and optimized to be on-line, the cost is high, the period is long, the robustness and the accuracy cannot be guaranteed by 100%, and the cut multi-stage recording also has the problems of compatibility with the historical recording, storage and correspondence of the multi-stage recording and the like.

Fig. 4 is a functional block diagram illustrating recording voice call content according to an embodiment of the present application, and the following is described with reference to fig. 4:

1) the target user is connected with the Lua of the intelligent voice through the call center + FS;

2) entering Lua, starting recording by the Lua, and recording the recording start time;

3) the Lua requests a process engine api, acquires a robot broadcast speech, calls TTS, and records calling start and stop time;

4) calling ASR by Lua, acquiring a translation text of the user speaking, and recording ASR starting and stopping time;

5) the Lua calling time processing module converts TTS and ASR starting and stopping time into relative time in the recording, and executes 3); unless the user walks to the on-hook node or the target user actively hangs up;

6) the Lua calls the post-hang processing module to pass the start-stop times of the sessions that have not been passed (e.g., on-hook sessions) to the flow engine api.

Through the technical scheme of the invention, 1) the following problems can be solved: the recording of the intelligent voice call record is a whole segment, the corresponding recording cannot be automatically positioned by clicking the text of the chat record, only the recording progress bar can be manually dragged, and the efficiency is low and inaccurate; 2) according to the scheme, the automatic recording of the starting and ending time points of each speech (including a target user and a robot) in voice interaction is realized on the premise of not greatly changing the recording, storing and tuning of the existing intelligent voice call records, and the speech is returned to the process engine through the api, so that the function of clicking a certain text in the chat records to automatically position the corresponding record can be realized through the corresponding interface on the front-end page. The method is accurate in control and convenient to realize, and is beneficial to quickly landing on the ground in related service scenes. The defects of the prior art can be overcome: recording in sections in the whole call process, because recording, storing and tuning of the recording depend on java application and FS components of a call center, the modification is large, the cost is high, and the problems of compatibility with historical recording, storage and correspondence of multi-section recording and the like exist; or the later stage of the whole recording is cut, if the manual work is used, the cost is too high and is not available at all, if the voice algorithm is used for automatic processing, the algorithm is developed and optimized to be on-line, the cost is high, the period is long, the robustness and the accuracy cannot be guaranteed by 100%, and the cut multi-stage recording also has the problems of compatibility with the historical recording, storage and correspondence of the multi-stage recording and the like.

In addition, the starting and ending time points of each speech (including the target user and the robot) in the voice interaction are directly recorded by using the Lua, the recording, storage and listening codes of the call center recording are not required to be modified, the algorithm is not required to be used for post-cutting, and the stability of the system is ensured; the intelligent voice transformation cost is reduced, and the intelligent voice transformation cost is more easily promoted and falls to the ground. The scheme adopted by the invention has good robustness, considers and covers various scenes, and is easy to realize and convenient to deploy.

In addition, Freeswitch is an open source telephony soft switch platform software widely used in call centers and intelligent voice, FS for short. Lua is a light and small scripting language, and FreeWITCH commonly uses Lua script to realize business logic. TTS: the English full name is Text To Speech synthesis, which converts the robot's language Text into Speech output. ASR: the English full name is Automatic Speech Recognition, and the Speech signal of the target user is converted into text for processing by a flow engine. A flow engine: and editing the flow and the words in the intelligent chat system, providing a module of corresponding service, and giving the words to be spoken by the robot next according to the words spoken by the target user. Time stamping: the current time point, which generally refers to the number of seconds (or milliseconds) elapsed from 1/1970, 1/0, and the current time point to the current time point, can be used to represent the current absolute time.

In addition, the scheme adopts the millisecond time stamp to record the start-stop time, and the second time stamp can also be adopted if the requirement on the precision is not high. In the scheme, the starting and stopping time points of TTS and ASR are returned together after one round of conversation (including one sentence spoken by the robot and one sentence spoken by the person), and can be returned respectively, so that the method is more flexible, but the api calling times can be increased. The interruption condition when TTS is broadcasted is not considered in the scheme, the target user can interrupt the broadcasting of the TTS by speaking because the TTS and the ASR are merged and called when the interruption function is opened, Lua cannot simply distinguish the end time of the TTS and the start time of the ASR, and the time stamp for starting the identification of the ASR is required to be given besides the identified text when the ASR returns, so that the Lua can conveniently process the condition, and at present, the function is realized by self-research ASR. The method adopts Lua to directly determine the starting and stopping time points, can also use a voice algorithm to carry out post-processing, determines the starting and stopping time points of each section (but not split), and determines the subsequent processing logic to be the same as the method, but the algorithm is developed, optimized and on-line, the cost is high, the period is long, and the robustness and the accuracy cannot be guaranteed by 100%. And related codes related to recording, storing and listening of the recording in the java application and FS components of the call center can be modified, and segmented recording of each voice segment in one-pass conversation is realized, but the modification is large, the cost is high, and the problems of compatibility with historical recording, storage and correspondence of multi-segment recording and the like exist. The later stage of the whole multi-section recording can be cut, if the manual work is used, the cost is too high and is not available at all, if the voice algorithm is used for automatic processing, the algorithm is developed and optimized to be on-line, the cost is high, the period is long, the robustness and the accuracy cannot be guaranteed by 100%, and the cut multi-section recording is compatible with the historical recording and has the storage and the corresponding problems of the multi-section recording. FreeWITCH may be replaced with other soft switch software, such as Asterisk; the Lua may also be replaced by any of the above-mentioned scripting languages that the softswitch platform supports running, such as Python.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 5 shows an apparatus 500 for recording voice call content according to the present embodiment, the apparatus 500 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 5, the apparatus 500 includes: a first determining module 510, configured to determine, in a voice call process between the robot program and a target user, text information corresponding to a part of audio in the voice call process, where the part of audio includes voice broadcasted by the robot program in at least one voice broadcasting period and/or voice received from the target user in at least one voice receiving period; a second determining module 520, configured to determine, according to time information corresponding to voices in at least one voice broadcasting period and/or at least one voice receiving period, start time information corresponding to a start point of the text message and end time information corresponding to an end point of the text message; and a first recording module 530 for recording start time information and end time information of the text information.

Optionally, the apparatus 500 further comprises: the second recording module is used for recording the call audio information in the voice call process; and the first recording module 530, further comprising: and the first recording submodule is used for recording a start time node and an end time node of the voice corresponding to the text information in the call audio information.

Optionally, the second determining module 520 includes: and the determining submodule is used for determining the starting time information and the ending time information of the text information according to the first starting time stamp and the first ending time stamp corresponding to the voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval.

Optionally, the second recording module includes: the second recording submodule is used for recording a second starting timestamp of the voice call; and the first determining submodule is used for determining a starting time node and an ending time node of the voice corresponding to the text information in the call audio information according to the second starting time stamp, the first starting time stamp and the first ending time stamp.

Optionally, the apparatus 500 further comprises calculating the start time node and the end time node by the following calculation formula: the calculation formula of the start time node is as follows: a start time node is a first start timestamp-a second start timestamp; and the calculation formula of the end time node is as follows: the end time node is the first end timestamp to the second start timestamp.

Optionally, the apparatus 500 further includes means for implementing voice broadcast by: the first recognition submodule is used for carrying out text recognition on preset robot interactive text information by utilizing a preset voice synthesis algorithm and converting the preset robot interactive text information into voice; and the broadcasting submodule is used for broadcasting the voice through the robot program.

Optionally, the first determining module 510 includes: and the second recognition submodule is used for performing voice recognition on the voice received from the target user by utilizing a preset voice recognition algorithm and generating text information.

Therefore, according to the embodiment, in the voice call process, the computing device determines the voice text information of the broadcast time interval or the receiving time interval, that is, associates the voice with the corresponding text information. The computing device may then record time node information of speech corresponding to each piece of text information in the call audio information. Therefore, the time and position information of the voice corresponding to the text information in the call audio information can be conveniently determined through the time node information in the later period. And the later-stage user can locate the position of the voice corresponding to the text information by clicking the text information. The method achieves the technical effects of not modifying the recording, storing and tuning codes of the call center recording, not using an algorithm to perform post cutting, and ensuring the stability of the system. Meanwhile, the method reduces the cost of intelligent voice modification, is compatible with historical recording, and is easy to realize voice and text correspondence and storage, so that the method is easier to popularize and fall to the ground. And then solved the existing mode of associating the recording and the call record storage in the intelligence conversation process that exists among the prior art, there are with high costs, with history record difficult compatible and the storage of multistage recording with correspond difficult technical problem.

Example 3

Fig. 6 shows an apparatus 600 for recording voice call content according to the present embodiment, the apparatus 600 corresponding to the method according to the first aspect of embodiment 1. Referring to fig. 6, the apparatus 600 includes: a processor 610; and a memory 620 coupled to the processor 610 for providing instructions to the processor 610 to process the following processing steps: in the voice communication process of the robot program and the target user, determining text information corresponding to partial audio in the voice communication process, wherein the partial audio comprises voice broadcasted by the robot program in at least one voice broadcasting time interval and/or voice received from the target user in at least one voice receiving time interval; determining start time information corresponding to a starting point of the text information and end time information corresponding to an end point of the text information according to time information corresponding to voice of at least one voice broadcasting time interval and/or at least one voice receiving time interval; and recording start time information and end time information of the text information.

Optionally, the memory 620 is further configured to provide the processor 610 with instructions to process the following processing steps: recording call audio information in the voice call process; and recording start time information and end time information of the text information, further comprising: and recording a start time node and an end time node of the voice corresponding to the text message in the call audio information.

Optionally, the memory 620 is further configured to provide the processor 610 with instructions to process the following processing steps: calculating a start time node and an end time node by the following calculation formula: the calculation formula of the start time node is as follows: a start time node is a first start timestamp-a second start timestamp; and the calculation formula of the end time node is as follows: the end time node is the first end timestamp to the second start timestamp.

Optionally, the memory 620 is further configured to provide the processor 610 with instructions to process the following processing steps: the voice broadcast is realized through the following operations: performing text recognition on preset robot interactive text information by using a preset speech synthesis algorithm and converting the preset robot interactive text information into speech; and broadcasting the voice through the robot program.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for recording voice call content, comprising:

in the voice call process of a robot program and a target user, determining text information corresponding to partial audio in the voice call process, wherein the partial audio comprises voice broadcasted by the robot program in at least one voice broadcast time interval and/or voice received from the target user in at least one voice receiving time interval;

determining start time information corresponding to a start point of the text information and end time information corresponding to an end point of the text information according to the time information corresponding to the voice of the at least one voice broadcasting time interval and/or the at least one voice receiving time interval; and

recording the start time information and the end time information of the text information.

2. The method of claim 1, further comprising:

recording the call audio information in the voice call process; and is

An operation of recording the start time information and the end time information of the text information, further comprising:

and recording a starting time node and an ending time node of the voice corresponding to the text message in the call audio information.

3. The method according to claim 2, wherein the operation of determining start time information corresponding to a start point of the text information and end time information corresponding to an end point of the text information from the time information corresponding to the voice of the at least one voice broadcast period and/or the at least one voice reception period comprises:

and determining the starting time information and the ending time information of the text information according to a first starting time stamp and a first ending time stamp corresponding to the voice of the at least one voice broadcasting time interval and/or the at least one voice receiving time interval.

4. The method of claim 3, wherein the operation of recording a start time node and an end time node of the speech corresponding to the text message in the call audio message comprises:

recording a second start timestamp of the voice call; and

and determining a starting time node and an ending time node of the voice corresponding to the text information in the call audio information according to the second starting timestamp, the first starting timestamp and the first ending timestamp.

5. The method of claim 4, further comprising calculating the start time node and the end time node by the following calculation:

the calculation formula of the start time node is as follows: the start time node is the first start timestamp-the second start timestamp; and

the calculation formula of the end time node is as follows: the end time node is the first end timestamp, the second start timestamp.

6. The method of claim 1, further comprising implementing the voice broadcast by:

performing text recognition on preset robot interactive text information by using a preset speech synthesis algorithm and converting the preset robot interactive text information into speech; and

and broadcasting the voice through the robot program.

7. The method of claim 1, wherein the act of determining text information corresponding to the portion of audio during the voice call comprises:

and performing voice recognition on the voice received from the target user by using a preset voice recognition algorithm and generating the text information.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. An apparatus for recording voice call content, comprising:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining text information corresponding to partial audio in the voice call process of a robot program and a target user, and the partial audio comprises voice broadcasted by the robot program in at least one voice broadcast time interval and/or voice received from the target user in at least one voice receiving time interval;

a second determining module, configured to determine, according to the time information corresponding to the voice in the at least one voice broadcasting period and/or the at least one voice receiving period, start time information corresponding to a start point of the text information and end time information corresponding to an end point of the text information; and

and the first recording module is used for recording the starting time information and the ending time information of the text information.

10. An apparatus for recording voice call content, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: