CN115022574B

CN115022574B - Subtitle processing method, device, equipment and storage medium

Info

Publication number: CN115022574B
Application number: CN202210603233.5A
Authority: CN
Inventors: 蒋叶婷; 马新雅; 杨文海; 李想; 郑康
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2024-12-27
Anticipated expiration: 2042-05-30
Also published as: CN115022574A

Abstract

Provided is a subtitle processing method, device, equipment and storage medium, wherein the method includes: detecting that a user enters a video conference scene, obtaining the main language information of the current speaker of the video conference as the first language information, and obtaining the main language information of the user as the second language information; when the first language information is different from the second language information, obtaining the voice data of the speaker, and generating corresponding subtitles according to the voice data; determining the display style of the subtitles according to the historical translation-related operations of the user; and pushing the subtitles to the video conference scene of the user in real time according to the display style of the subtitles. It can solve the problem of cross-language communication in the video conference scene and meet the user's need to understand the content of the meeting in the video conference scene.

Description

Subtitle processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a subtitle processing method, apparatus, device, and storage medium.

Background

With the popularization of the global collaboration scene, people need not to be limited to communicate with people of the same language any more for work, and cross-language communication becomes a work choice for more groups. In the video conference scene, if the user is not familiar with the language of the conference speaker, the user is difficult to understand the conference content, and the video conference effect is reduced. Based on the above, a technical scheme is necessary to be provided, so that the cross-language communication problem in the video conference scene is solved, and the requirement of users for knowing conference contents in the video conference scene is met.

Disclosure of Invention

The embodiment of the specification provides a subtitle processing method, device, equipment and storage medium, which are used for solving the problem of cross-language communication in a video conference scene and meeting the requirement of users for knowing conference contents in the video conference scene.

In a first aspect, an embodiment of the present disclosure provides a subtitle processing method, including:

Detecting that a user enters a video conference scene, acquiring main language information of a current speaker of the video conference as first language information, and acquiring main language information of the user as second language information;

when the first language information is different from the second language information, acquiring voice data of the speaker, and generating corresponding subtitles according to the voice data;

Determining the display style of the caption according to the history translation related operation of the user;

And pushing the caption to the video conference scene of the user in real time according to the display style of the caption.

In a second aspect, an embodiment of the present specification provides a subtitle processing apparatus, including:

the information acquisition unit is used for detecting that a user enters a video conference scene, acquiring main language information of a current speaker of the video conference as first language information and acquiring main language information of the user as second language information;

the subtitle generating unit is used for acquiring the voice data of the speaker when the first language information is different from the second language information, and generating corresponding subtitles according to the voice data;

A style determining unit, configured to determine a display style of the subtitle according to a history translation related operation of the user;

And the subtitle display unit is used for pushing the subtitles to the video conference scene of the user in real time according to the display style of the subtitles.

In a third aspect, an embodiment of the present specification provides a subtitle processing apparatus, including a processor, and a memory configured to store computer-executable instructions that, when executed, cause the processor to implement the steps of the method of the first aspect described above.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium for storing computer-executable instructions that, when executed by a processor, implement the steps of the method described in the first aspect.

In an embodiment of the present disclosure, a user is detected to enter a video conference scene, current main language information of a speaker of the video conference is obtained as first language information, the main language information of the user is obtained as second language information, when the first language information is different from the second language information, voice data of the speaker is obtained, corresponding subtitles are generated according to the voice data, a display style of the subtitles is determined according to relevant operation of historical translation of the user, and the subtitles are pushed to the video conference scene of the user in real time according to the display style of the subtitles, so that the subtitles are displayed in real time in the video conference scene of the user. Therefore, through the embodiment, the subtitle can be generated in real time according to the speaking content of the speaker in the video conference scene, the display style of the subtitle is determined according to the situation of the user, and the subtitle is displayed in real time in the video conference scene of the user based on the display style of the subtitle, so that the problem of cross-language communication in the video conference scene is solved, and the requirement of the user for knowing the conference content in the video conference scene is met.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described below, it being obvious that the drawings in the description that follow are only some of the embodiments described in the present description, from which other drawings can be obtained, without inventive faculty, for a person skilled in the art;

fig. 1 is a schematic flow chart of a subtitle processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a display style of a subtitle according to an embodiment of the present disclosure;

fig. 3a is a schematic view of a scene of an open caption in a video conference according to an embodiment of the present disclosure;

Fig. 3b is a schematic view of a scene of an open caption in a video conference according to another embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a caption processing device according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a subtitle processing apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive effort, are intended to be within the scope of the present disclosure.

It will be appreciated that before using the technical solutions disclosed in the embodiments of the present specification, the types, usage ranges, usage scenarios, etc. of the personal information related to the embodiments of the present specification should be notified to the user and authorized by the user in an appropriate manner according to the relevant laws and regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user may be allowed to autonomously select whether to provide personal information to software or hardware such as an electronic device, an application, a server, or a storage medium that performs the operations of the technical solution in the embodiment of the present description, based on the prompt information.

As an alternative but non-limiting implementation, the manner of sending the prompt information to the user may be, for example, a popup window, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization acquisition process is merely illustrative, and not limiting of the implementation of the illustrated embodiment, and that other ways of satisfying relevant legal regulations may be applied to the implementation of the illustrated embodiment.

Fig. 1 is a schematic flow chart of a subtitle processing method according to an embodiment of the present disclosure, as shown in fig. 1, the flow chart includes the following steps:

Step S102, detecting that a user enters a video conference scene, acquiring main language information of a current speaker of the video conference as first language information, and acquiring the main language information of the user as second language information;

Step S104, when the first language information is different from the second language information, obtaining the voice data of the speaker, and generating corresponding subtitles according to the voice data;

step S106, determining the display style of the caption according to the history translation related operation of the user;

And step S108, pushing the subtitles to the video conference scene of the user in real time according to the display style of the subtitles.

The subtitle processing method in this embodiment may be applied to a server, and executed by a server, where the server may be a server integrated with collaborative office software of a video conference application. Or the subtitle processing method in the embodiment can be applied to a user terminal of a user and executed by the user terminal, and the user terminal can be an intelligent terminal such as a mobile phone, a computer, a tablet personal computer, a notebook computer, a vehicle-mounted computer, a wearable device and the like. The user terminal may have co-office software integrated with a video conferencing application running thereon. The collaborative office software is integrated with a plurality of office applications such as instant messaging, cloud documents, audio and video conferences and the like, and can greatly improve collaborative office efficiency among staff.

In the step S102, it is detected whether the user enters the video conference scene, and if it is detected that the user clicks the button for joining the video conference, it is determined that the user enters the video conference scene. After the user is detected to enter the video conference scene, the main language information of the current speaker of the video conference is obtained as the first language information, and the main language information of the user is obtained as the second language information.

Wherein the current speaker of the video conference is the person speaking in the video conference at the current moment. The speaker of a video conference may change over time. In one embodiment, the main language information of the current speaker of the video conference is obtained, specifically:

(a1) Acquiring voice data sent by a speaker;

(a2) When the voice data corresponds to one language information, determining the corresponding language information as main language information of a speaker;

(a3) When the voice data corresponds to the multi-language information, determining the audio data quantity corresponding to the various-language information in the voice data, and determining the language information with the largest corresponding audio data quantity as the main language information of the speaker.

Under the video conference scene, each participant can communicate in a voice mode, based on the voice data, the voice data sent by the current speaker in the video conference can be obtained, and the voice data is judged to correspond to several languages. Specifically, the speech data is judged to include several languages, so that the speech data is determined to correspond to the several languages. For example, the voice data "hello" includes one language-Chinese, thereby corresponding to one language information-Chinese, and the voice data "hello" includes two languages-Chinese and English, thereby corresponding to two languages information-Chinese and English.

Then, if the voice data of the speaker is determined to correspond to one language information, the corresponding language information is determined to be the main language information of the speaker. If the voice data of the speaker corresponds to the plurality of language information, determining the audio data amount corresponding to the various language information in the voice data of the speaker, and determining the language information with the largest corresponding audio data amount as the main language information of the speaker.

For example, if the voice data of the speaker is hello, what is done in the weather today, the audio data amount corresponding to the various language information (chinese and english) in the voice data can be analyzed, and the language information (chinese) with the largest corresponding audio data amount can be determined as the main language information of the speaker.

It can be seen that, according to the embodiment, when the voice data of the speaker corresponds to multiple languages, the main language information of the speaker can be determined according to the audio data amount corresponding to the various languages in the voice data of the speaker, so that the main language information of the speaker can be accurately determined in the video conference scene.

In one example, the speaker's subject information may be determined periodically, such as every seven days, based on voice messages and/or text messages sent by the speaker in various applications integrated in the collaborative office software, such as instant messaging applications and video conferencing applications, within the seven days. In another example, the language of the conversation message sent by the speaker history in each application integrated by the collaborative office software, such as an instant messaging application and a video conference application, may be identified through a pre-trained machine learning model, for example, conversation messages sent by the speaker in the instant messaging application and the video conference application are input into the pre-trained machine learning model, the machine learning model outputs the language information of the conversation message, and the language information output by the machine learning model is used as the language label of the conversation message, so that the language label of the conversation message sent by the speaker in each application integrated by the collaborative office software in a last period (such as 7 days) may be counted to determine the main language information of the speaker. The session message may include a voice message and a text message, among others. Of course, if the speaker sets a tag of the subject information for itself, the tag may be directly acquired to determine the subject information of the speaker.

In step S102, after the speaker 'S subject information is determined, the speaker' S subject information is determined as the first language information. In the step S102, the subject information of the user is also obtained as the second language information. In one embodiment, the method comprises the steps of acquiring main language information of a user, specifically, identifying languages of conversation messages sent by a user in applications integrated by collaborative office software, such as instant messaging applications and video conference applications, by using a machine learning model, for example, inputting conversation messages sent by the user in the instant messaging applications and the video conference applications into a pre-trained machine learning model, outputting the languages information of the conversation messages by using the machine learning model, and taking the languages information output by the machine learning model as language labels of the conversation messages, so that the main language information of the user can be determined by counting language labels of conversation messages sent by the user in applications integrated by collaborative office software in a last period (such as 3 months). Of course, if the user sets the tag of the subject information for himself, the tag may be directly acquired to determine the subject information of the user. In this embodiment, after the subject information of the user is acquired, the subject information of the user is used as the second language information.

In the step S104, it is determined whether the first language information is the same as the second language information, if so, the caption in the video conference is kept in a non-open state, and if not, the voice data of the speaker speaking in the video conference process is obtained, and the corresponding caption is generated according to the voice data. The voice data may be converted into corresponding subtitles by voice recognition technology.

In step S106, the display style of the subtitle is determined according to the history translation related operation of the user. The display style of the subtitle includes displaying a translation of a part of text in the subtitle, displaying a translation of all text in the subtitle, and not displaying a translation of the subtitle.

Fig. 2 is a schematic diagram of a display style of a subtitle according to an embodiment of the present disclosure, and as shown in fig. 2, taking the same sentence of english subtitles as an example, the present embodiment provides three display styles, from top to bottom, respectively, of displaying a part of text in the subtitle, displaying all text in the subtitle, and not displaying the subtitle.

For a user with strong cross-language capability, only the caption of the speaker of the video conference may be required, so that the display style of the caption may be determined to be a translation in which the caption is not displayed. For users with certain cross-language capabilities, it may be necessary to display translations of rare words spoken by the speaker of the video conference, so that the display style of the subtitle may be determined to be a translation of part of the text in the display subtitle. For users with weak cross-language capabilities, they may need the subtitles of the speaker of the video conference and the corresponding full translations, so the display style of the subtitles may be determined to be the translations that display all the text in the subtitles.

In one embodiment, the display style of the subtitle is determined according to a user's history translation related operation, including at least one of:

(b1) If the user is determined to start the subtitle translation function for the video conference according to the historical translation related operation of the user, determining the display style of the subtitle as the translation of all texts in the display subtitle;

(b2) If the user does not start the subtitle translation function for the video conference according to the history translation related operation of the user, and the frequency of executing the appointed translation operation by the user history is greater than or equal to the preset frequency, determining that the display style of the subtitle is the translation of a part of text in the display subtitle;

(b3) If the user does not start the subtitle translation function for the video conference according to the historical translation related operation of the user, and the frequency of executing the specified translation operation by the user history is less than the preset frequency, determining that the display style of the subtitle is a translation without displaying the subtitle;

The translation operation is designated as a translation operation for the first language information.

The history translation related operations of the user include an on and off operation of the subtitle translation function for the video conference. Firstly, judging whether a user starts a subtitle translation function in a video conference according to historical translation related operation of the user, if yes, determining that the user starts the subtitle translation function for the video conference, determining that the user can know the video conference content only by all translations, and determining that the display style of the subtitle is the translation of all texts in the display subtitle.

If it is determined that the user does not start the subtitle translation function in the video conference, that is, if it is determined that the user does not start the subtitle translation function for the video conference, it is determined whether the frequency of performing the specified translation operation in the history of the user is greater than or equal to a preset frequency according to the history translation related operation of the user. The translation operation is designated as a translation operation of the user for the first language information.

In one embodiment, the video conference application where the video conference scene is located is integrated in the collaborative office software, and the historical translation related operation of the user includes a translation operation historically performed by the user in each application integrated by the collaborative office software, and whether the frequency of performing the specified translation operation by the user in the history is greater than or equal to the preset frequency can be determined according to the translation operation historically performed by the user in each application integrated by the collaborative office software.

Specifically, each of the translation operations historically performed by the user in each of the applications (such as the video conference application, the instant messaging application, the translation application, and the like) integrated by the collaboration office software providing the video conference function may be obtained, the specified translation operation for translating the language corresponding to the first language information may be extracted from the obtained translation operations, for example, each of the translation operations performed by the user in each of the applications integrated by the collaboration office software within a week before the current time may be obtained, and the specified translation operation for translating the language corresponding to the first language information may be extracted from the obtained translation operations. The specified translation operation for translating the language corresponding to the first language information may be a translation operation by which the user translates the language corresponding to the first language information in any case. For example, the translation operation of the speaker for the first language information in the single chat or group chat or video conference scene with the video conference, or the translation operation of the speaker for the first language information in the single chat or group chat or video conference scene with other users using the first language, or the translation operation of the user for the first language information in the translation application integrated by the collaborative office software.

Judging whether the frequency of the user history execution of the specified translation operation is greater than or equal to a preset frequency, for example, whether the frequency is greater than or equal to 2 times/week, if so, determining that the capability of the user for reading the first language information is weak, and determining that the user needs partial translation in the subtitle, thereby determining that the display style of the subtitle is the translation of the partial text in the display subtitle.

Conversely, if it is determined that the user does not start the subtitle translation function for the video conference, and the frequency of performing the specified translation operation by the user history is less than the preset frequency, the user is indicated to have strong capability of reading the first language information, and the display style of the subtitle is determined to be the translation in which the subtitle is not displayed.

In one embodiment, in a video conference, if the user performs a subtitle closing operation after opening a subtitle for the user, in a subsequent video conference, the subtitle is not automatically opened for the user through the flow in fig. 1.

Therefore, according to the embodiment, the display style of the subtitle corresponding to the user can be determined in a targeted manner according to the historical translation related operation of the user, so that the display style of the subtitle is matched with the language capability of the user, and better video conference experience is brought to the user.

In one embodiment, after determining that the display style of the subtitle is a translation of a portion of text in the subtitle, determining that a portion of content of the subtitle needs to be displayed is performed by:

(d1) Acquiring a target word stock of a user;

(d2) Comparing the caption with words in the target word stock to determine a first target word in the target word stock in the caption;

(d3) A translation of the first target word in the display subtitle is determined.

First, a target word stock of a user is obtained. The target word library records the rare words of the user under the first language information. Then, the caption and the words in the target word bank are compared one by one to determine a first target word in the target word bank in the caption, and it can be understood that the first target word is a rare word in the caption for the user under the first language information. And finally, determining the translation of the first target word in the display subtitle, wherein the translation is subordinate to the second language information corresponding to the user.

Therefore, by comparing the subtitles with the remote word stock corresponding to the user, the remote words of the user under the first language information can be accurately determined in the subtitles, and the translations of the remote words under the second language information can be determined and displayed, so that the user can conveniently know the content of the video conference, and the video conference communication efficiency of the user is improved.

In one embodiment, the target word stock of the user is obtained, specifically:

(d11) Acquiring a first word stock which is established in advance, and taking the first word stock as a target word stock of a user;

Or alternatively

(D12) Determining word libraries matched with the language features of the user from a plurality of second word libraries which are established in advance according to the historical translation related operation of the user, wherein the matched word libraries are used as target word libraries of the user;

Or alternatively

(D13) According to the related operation of the historical translation of the user, determining the words translated by the user in the historical way, acquiring a pre-established third word stock, and adding the words translated by the user in the historical way into the third word stock to acquire a target word stock of the user.

In this case, a first word stock is previously established, and the first word stock is a rare word stock which is common to any user, and the first word stock can be exemplified as a word stock containing words of a first level (containing) and above in the english field, and the first word stock is used as a target word stock of the user.

In another case, according to the translation operation of the user in each application integrated by the collaborative office software, which is included in the related operation of the history translation of the user, determining the level of the word which the user history translates belongs to, selecting a word stock which is matched with the level of the word which the user history translates belongs to from a plurality of pre-established second word stocks, and using the word stock as a word stock which is matched with the language characteristics of the user, and further using the word stock as a target word stock of the user. For example, the words translated by the user history belong to the first level of the english field and the level above the first level, and a word stock including the words of the first level of the english field and the level above the first level is selected from the plurality of second word stocks established in advance as the target word stock of the user.

In yet another case, the words that have been historically translated by the user are determined based on the user's historical translation operations that have been historically performed by the user in the respective applications integrated by the collaborative office software. And obtaining a pre-established third word stock, wherein the third word stock is also a remote word stock which is common to any user, and the third word stock can be the same as the first word stock. And adding the words translated by the user history into a third word stock to obtain a target word stock of the user.

It can be seen that, in this embodiment, in one case, a word stock matching the level of the words translated by the user history may be selected as the target word stock of the user according to the level of the words translated by the user history, and in another case, the words translated by the user history may be added to the universal word stock to form a user-personalized uncommon word stock. Therefore, the personalized rare word library of the user is obtained in various modes, and when the translation of part of text in the subtitle is determined, the words unfamiliar to the user can be accurately determined and translated.

In one embodiment, under the above process, determining a second target word in the target word bank of the user, wherein the translation number in the video conference scene of the user reaches a preset number threshold, and marking the second target word as the word for suspending translation in the video conference scene of the user. Because the translation times of the second target word are more, the user is presumed to have mastered the word and the translation thereof, and therefore the second target word is marked as the word which pauses the translation in the video conference scene of the user, so that the translation of the second target word is not provided for the user in the video conference scene, and the workload of the system is saved.

In the step S108, after determining the display style of the subtitle, the subtitle is pushed to the video conference scene of the user in real time according to the display style of the subtitle, so as to display the subtitle for the user in real time in the video conference scene, and assist the user in knowing the conference content. The pushing the caption to the video conference scene of the user in real time means that the caption is pushed to the interface of the video conference of the user so as to display the caption in the interface of the video conference of the user.

Fig. 3a is a schematic view of a scene of a subtitle opening in a video conference according to an embodiment of the present disclosure, as shown in fig. 3a, the subtitle may be automatically opened for a user in the video conference, and a display style of the subtitle is determined automatically, and fig. 3a illustrates an example in which the display style is taken as a display style to display all translations.

Fig. 3b is a schematic view of a scene of a subtitle opened in a video conference according to another embodiment of the present disclosure, corresponding to fig. 3a, and in fig. 3b, after the subtitle is opened, all translations of the subtitle may be displayed to a user in real time.

Therefore, according to the embodiment, the subtitle of the video conference can be automatically opened for the user in the video conference scene, so that the video experience of the user is improved. Before executing the embodiment, the user may be prompted whether to agree to authorize the automatic start of the subtitle recommendation function of the video conference, and the user may be notified that, when the subtitle recommendation function of the video conference is automatically started, the user needs to acquire the necessary personal information to help the user analyze the display style of the subtitle when the subtitle is started. If the user clicks the subtitle recommending function which agrees to automatically start the video conference, the subtitle can be automatically started for the user in the video conference scene and the display style of the subtitle can be determined through the process under the condition that the privacy of the user is not violated.

In one embodiment, if the display style of the subtitle is determined to be the translation of a part of text in the subtitle, correspondingly, pushing the subtitle to the video conference scene of the user in real time according to the display style of the subtitle may be that the subtitle and the translation of the part of text in the subtitle are displayed in a subtitle display area in the video conference scene of the user, wherein if the same word to be translated appears at least twice in the subtitle currently displayed in the subtitle display area, the translation of the word to be translated in the front position in the subtitle is displayed.

Specifically, when displaying the subtitle and the translated version of a part of text in the subtitle to the user, if a word to be translated appears multiple times in the subtitle displayed in the subtitle display area, the word with the front position among the words appearing each time is determined in the subtitle displayed in the current display area, the translated version of the word with the front position is displayed, and for the word appearing subsequently in the subtitle displayed in the current display area, the translated version is not displayed, so that the space of the subtitle display area is saved. The word located at the front may be a word which is spoken first according to the speaking order of the speaker, and the word is located at the front in the order of the text from left to right in the subtitle displayed at present.

For example, in a subtitle display area in a video conference scene, a currently displayed subtitle is "apple, pear, apple", where "apple" is a word to be translated, and when the word appears twice, only a translation of the word appearing for the first time is displayed, and when the word appearing for the second time does not display a translation, the subtitle and its translations are "apple (apple), spar, apple" respectively.

It should be noted that, the caption displayed in the caption display area in the video conference scene will change along with the speaking progress of the speaker, if according to the above procedure, it is determined that a word to be translated appears twice in the caption currently displayed in the caption display area, and after displaying the translation of the word with the front position in the caption, if the caption in the caption display area changes, the word appears in the changed caption again, then the translation of the word with the front position is displayed again according to the above procedure, so as to facilitate the user to know the meaning of the word in the switched caption.

In one embodiment, the conference caption of the video conference may also be displayed in the video conference scene of the user in response to the caption opening operation of the user, and the translation of the third target word is displayed in real time in the video conference scene in response to the translation operation of the user for the third target word in the conference caption.

The conference caption is a caption generated in the foregoing according to the voice data of the speaker, and the caption can be generated according to the voice data of the speaker by a voice recognition technology according to the flow of the foregoing. Specifically, the process in fig. 1 is a process of automatically opening a subtitle for a user and determining a subtitle display style, in other embodiments, the user may also manually open the subtitle, after the user manually opens the subtitle, the subtitle defaults to not display a translation, and the user may translate the third target word in the subtitle, so that the translation of the third target word is displayed in the video conference scene, thereby providing a required translation function for the user according to the operation of the user in the video conference scene.

In summary, through the embodiment, the subtitle can be generated in real time according to the speaking content of the speaker in the video conference scene, the display style of the subtitle is determined according to the user condition, and the subtitle is displayed in real time in the video conference scene of the user based on the display style of the subtitle, so that the problem of cross-language communication in the video conference scene is solved, and the requirement of the user for knowing the conference content in the video conference scene is met.

Fig. 4 is a schematic structural diagram of a caption processing device according to an embodiment of the present disclosure, as shown in fig. 4, the device includes:

An information obtaining unit 41, configured to detect that a user enters a video conference scene, obtain subject information of a current speaker of the video conference as first language information, and obtain subject information of the user as second language information;

A caption generating unit 42, configured to obtain voice data of the speaker when the first language information is different from the second language information, and generate a corresponding caption according to the voice data;

a style determining unit 43 for determining a display style of the subtitle according to the history translation related operation of the user;

and a caption display unit 44, configured to push the caption to the video conference scene of the user in real time according to the display style of the caption.

Optionally, the display style of the subtitle includes displaying a translation of a part of text in the subtitle, displaying a translation of all text in the subtitle, and not displaying a translation of the subtitle.

Alternatively, the information acquisition unit 41 is specifically configured to:

Acquiring voice data of the speaker;

When the voice data corresponds to one language information, determining the corresponding language information as main language information of the speaker;

when the voice data corresponds to multiple languages, determining the audio data amount corresponding to the languages in the voice data, and determining the languages with the largest corresponding audio data amount as the main language information of the speaker.

Optionally, the style determining unit 43 specifically includes at least one of the following:

If the user is determined to start the subtitle translation function for the video conference according to the historical translation related operation of the user, determining the display style of the subtitle as a translation for displaying all texts in the subtitle;

If the user does not start the subtitle translation function for the video conference according to the historical translation related operation of the user, and the frequency of executing the specified translation operation by the user history is greater than or equal to a preset frequency, determining that the display style of the subtitle is a translation for displaying part of text in the subtitle;

If the user is determined not to start the subtitle translation function for the video conference according to the historical translation related operation of the user, and the frequency of executing the specified translation operation by the user history is smaller than the preset frequency, determining that the display style of the subtitle is not to display the translation of the subtitle;

The specified translation operation is a translation operation aiming at the first language information.

The device comprises a user setting unit, a video conference application, a judging unit and a judging unit, wherein the video conference application in which the video conference scene is positioned is integrated in collaborative office software, the judging unit is used for judging whether the frequency of executing the appointed translation operation by the user history is greater than or equal to the preset frequency according to the translation operation historically executed by the user in each application integrated by the collaborative office software.

Optionally, the method further comprises a word stock comparison unit for:

After determining that the display style of the subtitle is a translation of a part of text in the subtitle, acquiring a target word stock of the user;

comparing the caption with words in the target word stock to determine a first target word in the target word stock in the caption;

and determining the translation of the first target word in the caption.

Optionally, the word stock comparison unit is specifically configured to:

acquiring a first word stock which is established in advance, and taking the first word stock as a target word stock of the user;

Or alternatively

Determining a word stock matched with the language characteristics of the user from a plurality of pre-established second word stocks according to the historical translation related operation of the user, and taking the matched word stock as a target word stock of the user;

Or alternatively

And determining the words which are historically translated by the user according to the historical translation related operation of the user, and adding the words which are historically translated by the user into a third word stock which is pre-established to obtain a target word stock of the user.

Optionally, the device further comprises a marking unit for:

determining a second target word of which the translation times in the video conference scene of the user reach a preset time threshold value in the target word library;

marking the second target word as a word that pauses translation in the user's video conferencing scene.

Optionally, if it is determined that the display style of the subtitle is a translation for displaying a part of text in the subtitle, the subtitle display unit 44 is specifically configured to:

displaying the subtitles and the translations of partial texts in the subtitles in a subtitle display area in a video conference scene of the user;

and if the same word to be translated appears at least twice in the caption currently displayed in the caption display area, displaying the translation of the word to be translated in the caption with the front position.

Optionally, the system further comprises a manual translation unit for:

responding to the subtitle opening operation of the user, and displaying conference subtitles of the video conference in a video conference scene of the user;

and responding to the translation operation of the user for the third target word in the conference subtitle, and displaying the translation of the third target word in the video conference scene in real time.

Note that, the subtitle processing apparatus in this embodiment may implement the processes of the foregoing embodiments of the subtitle processing method, and achieve the same effects and functions, which are not repeated here.

An embodiment of the present disclosure further provides a subtitle processing apparatus, and fig. 5 is a schematic structural diagram of the subtitle processing apparatus provided in an embodiment of the present disclosure, as shown in fig. 5, the subtitle processing apparatus may generate relatively large differences due to different configurations or performances, and may include one or more processors 801 and a memory 802, where the memory 802 may store one or more applications or data. Wherein the memory 802 may be transient storage or persistent storage. The application program stored in the memory 802 may include one or more modules (not shown in the figures), each of which may include a series of computer-executable instructions in the caption processing device. Still further, the processor 801 may be configured to communicate with a memory 802 and execute a series of computer executable instructions in the memory 802 on a caption processing device. The caption processing device may also include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input or output interfaces 805, one or more keyboards 806, etc.

The caption processing device may be connected to a network, such as the internet or a local area network, through a wired or wireless network interface 804, to communicate with other devices in the network. The caption processing device may receive data input from or output data to an external device through the input or output interface 805, and in one example, the input or output interface 805 includes, but is not limited to, a touch display screen. Through the keypad 806, the caption processing device may receive information input by a user and perform corresponding processing.

In a specific embodiment, the caption processing device may be a server or a user terminal, including a processor, and a memory configured to store computer executable instructions that, when executed, cause the processor to implement the following:

Another embodiment of the present specification also provides a computer-readable storage medium for storing computer-executable instructions that, when executed by a processor, implement the following flow:

The storage medium in this embodiment may implement the processes of the foregoing embodiments of the subtitle processing method, and achieve the same effects and functions, which are not repeated here.

The computer readable storage medium includes Read-Only Memory (ROM), random access Memory (Random Access Memory RAM), magnetic disk or optical disk, etc.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, and the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present specification.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.

Claims

1. A subtitle processing method, comprising:

Detecting that a user enters a video conference scene, obtaining primary language information of a current speaker of the video conference as first language information, and obtaining primary language information of the user as second language information;

Determining a display style of the subtitles according to the historical translation-related operations of the user; the display style of the subtitles includes any one of the following: displaying a translation of a portion of the text in the subtitles, displaying a translation of all the text in the subtitles, or not displaying a translation of the subtitles;

The determining of the display style of the subtitles according to the historical translation-related operations of the user specifically includes: determining, according to the historical translation-related operations of the user, whether the user has enabled the subtitle translation function for the video conference and whether the frequency of the user's historical execution of the designated translation operation is greater than or equal to a preset frequency, thereby determining the display style of the subtitles; the designated translation operation is a translation operation for the first language information;

According to the display style of the subtitles, the subtitles are pushed to the video conference scene of the user in real time.

2. The method according to claim 1, wherein obtaining the main language information of the current speaker of the video conference comprises:

Acquiring voice data of the speaker;

When the voice data corresponds to a language information, determining the corresponding language information as the main language information of the speaker;

When the voice data corresponds to multiple language information, the amount of audio data corresponding to each of the language information in the voice data is determined, and the language information corresponding to the largest amount of audio data is determined as the main language information of the speaker.

3. The method according to claim 1, wherein determining the display style of the subtitles according to the historical translation-related operations of the user comprises at least one of the following:

If it is determined that the user has enabled the subtitle translation function for the video conference based on the historical translation-related operations of the user, then the display style of the subtitles is determined to be to display the translation of all the text in the subtitles;

If it is determined according to the historical translation-related operations of the user that the user has not enabled the subtitle translation function for the video conference, and the frequency of the user's historical execution of the specified translation operation is greater than or equal to the preset frequency, then determining that the display style of the subtitles is to display the translation of part of the text in the subtitles;

If it is determined based on the user's historical translation-related operations that the user has not enabled the subtitle translation function for the video conference, and the user's historical frequency of performing the specified translation operation is less than the preset frequency, then the display style of the subtitles is determined to be not displaying the translation of the subtitles.

4. The method according to claim 3, characterized in that the video conferencing application in which the video conferencing scene is located is integrated into collaborative office software; the method further comprises:

According to the translation operations historically performed by the user in various applications integrated with the collaborative office software, it is determined whether the frequency of the user's historical execution of the designated translation operation is greater than or equal to the preset frequency.

5. The method according to claim 1, characterized in that if it is determined that the display style of the subtitle is to display the translation of part of the text in the subtitle, the method further comprises:

Acquire the target vocabulary of the user;

Comparing the subtitles with words in the target vocabulary to determine a first target word in the target vocabulary in the subtitles;

Determine to display a translation of the first target word in the subtitle.

6. The method according to claim 5, characterized in that the step of obtaining the target vocabulary of the user comprises:

Acquire a pre-established first word library, and use the first word library as a target word library for the user;

or,

According to the historical translation-related operations of the user, determining a word library that matches the language features of the user from a plurality of pre-established second word libraries, and using the matching word library as the target word library of the user;

or,

According to the historical translation related operations of the user, the words historically translated by the user are determined, and the words historically translated by the user are added to a pre-established third word library to obtain the target word library of the user.

7. The method according to claim 5, characterized in that the method further comprises:

Determine in the target word library a second target word whose number of translations in the video conference scenario of the user reaches a preset number threshold;

The second target word is marked as a word for which translation is suspended in the video conference scene of the user.

8. The method according to claim 1, characterized in that if it is determined that the display style of the subtitles is to display the translation of part of the text in the subtitles, pushing the subtitles to the user's video conference scene in real time according to the display style of the subtitles comprises:

Displaying the subtitles and a translation of a portion of the text in the subtitles in a subtitle display area in the user's video conference scene;

If the same word to be translated appears at least twice in the subtitles currently displayed in the subtitle display area, the translation of the word to be translated that is located at the front of the subtitles is displayed.

9. The method according to claim 1, characterized in that the method further comprises:

In response to a subtitle-enabling operation of the user, displaying conference subtitles of the video conference in the video conference scene of the user;

In response to the user's translation operation on the third target word in the conference subtitles, the translation of the third target word is displayed in real time in the video conference scene.

10. A subtitle processing device, comprising:

An information acquisition unit is used to detect that a user enters a video conference scene, acquire the primary language information of the current speaker of the video conference as the first language information, and acquire the primary language information of the user as the second language information;

a subtitle generating unit, configured to obtain voice data of the speaker and generate corresponding subtitles according to the voice data when the first language information is different from the second language information;

A style determination unit, configured to determine a display style of the subtitles according to the historical translation-related operations of the user; the display style of the subtitles includes any one of the following: displaying a translation of a portion of the text in the subtitles, displaying a translation of all the text in the subtitles, or not displaying a translation of the subtitles;

The determining of the display style of the subtitles according to the historical translation-related operations of the user specifically includes: determining whether the user has enabled the subtitle translation function for the video conference and whether the frequency of the user's historical execution of the designated translation operation is greater than or equal to a preset frequency according to the historical translation-related operations of the user, thereby determining the display style of the subtitles; the designated translation operation is a translation operation for the first language information;

A subtitle display unit is used to push the subtitles to the user's video conference scene in real time according to the display style of the subtitles.

11. A subtitle processing device, comprising:

processor; and,

A memory configured to store computer executable instructions, which, when executed, cause the processor to implement the steps of the method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store computer-executable instructions, and when the computer-executable instructions are executed by a processor, the steps of the method described in any one of claims 1 to 9 are implemented.