WO2023219556A1

WO2023219556A1 - A system and method to manage a plurality of language audio streams

Info

Publication number: WO2023219556A1
Application number: PCT/SG2022/050321
Authority: WO
Inventors: Peng SONG
Original assignee: Song Peng
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2023-11-16

Abstract

The present invention provides a system and method to manage a plurality of language audio streams. The system and method broadly provides users with an interface to optimise the output to users from a plurality of language audio streams. Optimising the output includes being able to calibrate respective volume levels of the plurality of language audio streams. This provides advantages such as, for example, discerning verbal cues of a speaker/performer, auditory hearing comfort for a user, reduction of destructive interference for the multiple streams of audio leading to better understanding by the user.

Description

A SYSTEM AND METHOD TO MANAGE A PLURALITY OF LANGUAGE AUDIO STREAMS

Field of the Invention

The present invention relates to a system, and method to manage a plurality of language audio streams, particularly during a live translation instance for a video stream.

Background

The world has recently embraced online meetings and events as a viable alternative to in-person meetings and events. The covid-19 pandemic has substantially hastened the digitization of events and meetings, and correspondingly, the audience reach is no longer constrained by geography, with language typically being the only barrier for effective communication.

Currently, many online meetings and event platforms are based on WebRTC protocols to enable browser-based versions of the platforms. However, many of such platforms do not include built-in simultaneous translation capabilities to facilitate multilingual communication.

In many instances, machine implemented translations are less desirable than human translations, for example, in relation to contextual nuances. Thus, the need for human simultaneous interpretation for online meetings has increased substantially.

Thus, a system and method to provide a desirable experience and functionality for users when providing human translations on WebRTC platforms is currently lacking.

Summary In a first aspect, there is provided a system to manage a plurality of language audio streams, the system comprising at least one data processing device configured to: transmit, from a content generator, an original language audio stream; receive, at the user device, the original language audio stream and at least two translated language streams; activate, at the user device, a browser extension; select, at the user device, an “on” state for a first translated language stream; toggle, at the user device, an “off” state for at least a second translated language stream; and adjust, at the user device, a volume level of the original language stream and the first translated language stream.

In a second aspect, there is provided a data processor implemented method to manage a plurality of language audio streams, the method comprising: transmitting, from a content generator, an original language audio stream; receiving, at the user device, the original language audio stream and at least two translated language streams; activating, at the user device, a browser extension; selecting, at the user device, an “on” state for a first translated language stream; toggling, at the user device, an “off” state for at least a second translated language stream; and adjusting, at the user device, a volume level of the original language stream and the first translated language stream.

It will be appreciated that the broad forms of the invention and their respective features can be used in conjunction, interchangeably and/or independently, and reference to separate broad forms is not intended to be limiting.

Brief Description of the Drawings A non-limiting example of the present invention will now be described with reference to the accompanying drawings, in which:

FIG 1 provides a schematic view of a first embodiment of the present invention;

FIG 2 provides a schematic view of second embodiment of the present invention;

FIG 3 provides a schematic view of an example of a system for managing a plurality of language audio streams;

FIG 4 is a schematic diagram showing components of an example user device of the system shown in FIG 3; and

FIG 5 is a schematic diagram showing components of an example central server shown in FIG 3.

Detailed Description

In the present method, browser extension (add-on) based approaches are implemented to enable any online meeting and event platform on desktop and laptop computers (either Windows or Mac) with on-demand delivery of simultaneous interpretation by human interpreters. There are typically two types of video conference or live streaming platforms that can be enabled in browsers (Chrome, Firefox or Microsoft Edge, etc.), namely those which rely on:

1 ) hidden media stream that cannot be controlled, such as Zoom, Webex, and so forth, and

2) exposed media stream that can be controlled, such as Teams, Google Meet, Youtube Live, and so forth.

The method and system of the present invention is for both types of platforms, correspondingly offering the same advantages.

An example of a broad overview of a first embodiment of the present invention which is applicable to the instance of hidden media streams from video conference/live streaming platforms is shown in FIG 1 .

A method 100 for managing a plurality of audio language streams is provided. As a language stream of the meeting/event is hidden, there is provided a browser extension (add-on) to open a new tab with interpretation audio streams.

At step 105, there are various language streams for consumption. In this example, the original language (floor) stream and three other language streams are available for selection by the user. Even though it is envisaged that the translation for each language stream is carried out by human translators, the present invention does not preclude the use of machine translators for any of the language streams.

Subsequently, at step 1 10, the user is able to toggle a desired language stream between an “on” state or an “off” state. In the “on” state, that language stream will be consumable by the user. In the “off” state, that language stream will not be consumable by the user. However, the state of the language stream is dependent on a selection of the user at step 1 15. At step 115, the desired language stream in the “on” state will be the language stream being consumed and not available for selection, while the other language streams in the “off” state which are not being consumed are then available for selection. FIG 1 shows the plurality of language streams being “channel 1”, “channel 2” and “channel 3”. It should be appreciated that the plurality of language streams can be indicated differently as well.

It should be appreciated that the method 100 enables seamless real-time switching of channels between floor (original meeting audio) and a plurality of translations.

During use of the method 100, for example, a translation streaming URL can be provided to a browser extension, which enables opening of a new tab. This new tab can include a graphical user interface that is configured to provide, for example:

- audio volume display,

- a channel/language selection list; and

- an exit selector.

Once the channel/language is selected from the graphical user interface in the tab, the user then returns to the meeting/event video, and proceeds to appreciate the meeting/event in the desired language.

For the purpose of illustration, it is assumed that the method 100 can be performed at least in part amongst one or more data processing devices such as, for example, a laptop, a desktop computer, a central server, or the like. Typically, the central server will be configured to carry out a majority of the processing tasks given the processing load required by the method 100. Referring to FIG 2, there is shown an example of a second embodiment of the present invention which is applicable to the instance of exposed media streams from video conference/live streaming platforms.

A method 200 for managing a plurality of audio language streams is provided. As a language stream of the meeting/event is exposed, there is also provided a browser extension (add-on) to open a new tab with translation audio streams, and the browser extension being configured to provide a graphical user interface.

At step 205, there are various language streams available for consumption. In this example, the original language (floor) stream and three other language streams are available for selection by the user. Even though it is envisaged that the translation for each language stream is carried out by human translators, the present invention does not preclude the use of machine translators for any of the language streams.

Subsequently, at step 210, the user is able to toggle a desired language stream between an “on” state or an “off” state. In the “on” state, that language stream will be consumable by the user. In the “off” state, that language stream will not be consumable by the user. However, the state of the language stream is dependent on a selection of the user at step 215.

At step 215, the desired language stream in the “on” state will be the language stream being consumed and not available for selection, while the other language streams in the “off” state which are not being consumed are then available for selection. FIG 2 shows the plurality of language streams being “channel 1”, “channel 2” and “channel 3”. It should be appreciated that the plurality of language streams can be indicated differently as well.

At step 220, the exposed language streams are able to enable volume control of the respective streams. This allows the user to listen to both an actual language of the meeting/event, together with a desired translation language, each at different respective volume levels.

It should be appreciated that the method 200 enables seamless concurrent real-time consumption of channels between floor (original meeting audio) and a plurality of translations.

During use of the method 200, for example, translated language streams are transmitted (via WebRTC) separately from the meeting stream, and shown in a semitransparent movable pop-up frame window floating on top of the meeting/event video, with a graphical user interface that is configured to provide, for example:

- a language/channel selection list,

- language audio penetration functionality;

- volume control functionality for all language streams;

- refresh connector toggle; and

- an exit selector.

Once the channel/language is selected from the graphical user interface in the tab, the user then returns to the meeting/event video, and proceeds to appreciate the meeting/event in the desired language, and if desired, concurrently with an original language of the meeting/event. Consuming the meeting/event video with both the original language concurrently with a translated language provides advantages such as, for example, discerning verbal cues of a speaker/performer, auditory hearing comfort for a user, reduction of destructive interference for the multiple streams of audio leading to better understanding by the user.

It should be appreciated that the flexibility provided by the volume control of the respective language audio streams is a long desired feature in relation to online meeting/event videos. The capability of doing so contributes substantially towards the appreciation of online meeting/event videos, and this can also enhance user engagement. This is important in an era when online live selling, and online auctions are growing in popularity. During online live selling, and online auctions, when user engagement is enhanced, it leads to a direct beneficial consequence of improved sales and increased revenue.

For the purpose of illustration, it is assumed that the method 200 can be performed at least in part amongst one or more data processing devices such as, for example, a laptop, a desktop computer, a central server, or the like. Typically, the central server will be configured to carry out a majority of the processing tasks given the processing load required by the method 100.

An example of a system 300 to manage a plurality of language audio streams will now be described with reference to FIG 3.

In this example, the system 300 includes one or more user devices 320, a communications network 350, one or more content generators 380 (for example, a broadcaster, a live auctioneer, an event provider and so forth who may not be based at the same physical location), and a central server 360 (eg. a central administrator for providing a translation service for online meetings/events.). The one or more user devices 320 and the one or more content generators 380 communicate with the central server 360 via the communications network 350. The communications network 350 can be of any appropriate form, such as the Internet and/or a number of local area networks (LANs). Further details of respective components of the system 300 will be provided in a following portion of the description. It will be appreciated that the configuration shown in FIG 3 is not limiting and for the purpose of illustration only.

User Device 320

The user device 320 of any of the examples herein may be a laptop computer or a desktop computer, being configured with a capability to access the internet and/or download and operate web browsers, while being connectable to the communications network 350. The user device 320 should be able to run the graphical user interfaces of methods 100/200 when the methods are being carried out.

An exemplary embodiment of the user device 320 is shown in FIG 4. As shown, the user device 320 includes the following components in electronic communication via a bus 41 1 :

1. a display 402;

2. non-volatile memory 403;

3. random access memory ("RAM") 404;

4. data processor(s) 401 ;

5. a transceiver component 405 that includes a transceiver(s);

6. an image capture module 410; and

7. input controls 407.

In some embodiments, software 409 is stored in the non-volatile memory 403 to enable the user device 320 to operate a web browser. Once the user device 320 is able to operate the web browser, plug-ins can then be enabled to enable the carrying out of the methods 100/200.

Although the components depicted in FIG 4 represent physical components, FIG 4 is not intended to be a hardware diagram; thus many of the components depicted in FIG 4 may be realized by common constructs or distributed among additional physical components. Moreover, it is contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to FIG 4.

Central Server 360

The central server 360 is a hardware and software suite comprised of preprogrammed logic, algorithms and other means of processing information coming in, in order to send out information which is useful to the objective of the system 300 in which the central server 360 resides. For the sake of illustration, hardware which can be used by the central server 360 will be described briefly herein.

The central server 360 can broadly comprise a database which stores pertinent information, and processes information packets from the user devices 320. In some embodiments, the central administrator for providing a translation service for online meetings/events runs the central server 360. The central server 360 can be operated from a commercial hosted service such for example, Amazon Web Services, Alibaba Cloud and so forth. .

In one possible embodiment, the central server 360 is represented in a form as shown in FIG 5.

The central server 360 is in communication with a communications network 350, as shown in FIG 3. The central server 360 is able to communicate with the user devices 320, the content generators 380 and/or other processing devices, as required, over the communications network 350. In some instances, the user devices 320, communicate via a direct communication channel (LAN or WIFI) with the central server 360.

The components of the central server 360 can be configured in a variety of ways. The components can be implemented entirely by software to be executed on standard computer server hardware, which may comprise one hardware unit or different computer hardware units distributed over various locations, some of which may require the communications network 350 for communication.

In the example shown in FIG 5, the central server 360 is a commercially available computer system based on a 32 bit or a 64 bit Intel architecture, and the processes and/or methods executed or performed by the central server 360 are implemented in the form of programming instructions of one or more software components or modules 502 stored on non-volatile computer-readable storage 503 associated with the central server 360.

The server 360 includes at least one or more of the following standard, commercially available, computer components, all interconnected by a bus 505:

1 . random access memory (RAM) 506; and

2. at least one central processing unit (CPU) 507.

Although the components depicted in FIG 5 represent physical components, FIG 5 is not intended to be a hardware diagram; thus many of the components depicted in FIG 5 may be realized by common constructs or distributed among additional physical components. Moreover, it is contemplated that other existing and yet-to-be developed physical components and architectures may be utilized to implement the functional components described with reference to FIG 5.

Content Generator 380

It should be appreciated that each content generator 380 should be capable of providing at least one audio recording stream, so as to ensure that there is content to be translated. Typically, each content generator 380 should use a device which is configured to carry out at least the following tasks:

- record an audio stream; and

- connect to the communications network 350.

It should be appreciated that a capability of capturing a video stream can be optional for each content generator 380.

It should be appreciated that the system 300 enables the methods 100/200 to be carried out in a desired manner as described in earlier paragraphs. However, it should also be noted that the methods 100/200 need not be carried out only using the system

300. Other systems may also be used to enable the methods 100/200.

As also mentioned for the method 200, the system 300 is able to enable the same advantages. For example, the flexibility provided by the volume control of the respective language audio streams is a long desired feature in relation to online meeting/event videos. The capability of doing so contributes substantially towards the appreciation of online meeting/event videos, and this can also enhance user engagement. This is critical in an era when online live selling, and online auctions are growing in popularity. During online live selling, and online auctions, when user engagement is enhanced, it leads to a direct beneficial consequence of improved sales and increased revenue.

Throughout this specification and claims which follow, unless the context requires otherwise, the word “comprise”, and variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated integer or group of integers or steps but not the exclusion of any other integer or group of integers.

Persons skilled in the art will appreciate that numerous variations and modifications will become apparent. All such variations and modifications which become apparent to persons skilled in the art, should be considered to fall within the spirit and scope that the invention broadly appearing before described.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A system to manage a plurality of language audio streams, the system comprising at least one data processing device configured to: transmit, from a content generator, an original language audio stream; receive, at the user device, the original language audio stream and at least two translated language streams; activate, at the user device, a browser extension; select, at the user device, an “on” state for a first translated language stream; toggle, at the user device, an “off” state for at least a second translated language stream; and adjust, at the user device, a volume level of the original language stream and the first translated language stream.

2. The system of claim 1 , wherein the original language audio stream is obtained from a video stream.

3. The system of claim 2, wherein the at least two translated language streams are transmitted separately from the video stream.

4. The system of any of claims 1 -3, wherein the browser extension is configured to provide a graphical user interface.

5. The system of any of claims 1 to 4, wherein the at least two translated language streams are human generated.

6. The system of any of claims 1 to 4, wherein the at least two translated language streams are machine generated.

7. A data processor implemented method to manage a plurality of language audio streams, the method comprising: transmitting, from a content generator, an original language audio stream; receiving, at the user device, the original language audio stream and at least two translated language streams; activating, at the user device, a browser extension; selecting, at the user device, an “on” state for a first translated language stream; toggling, at the user device, an “off” state for at least a second translated language stream; and adjusting, at the user device, a volume level of the original language stream and the first translated language stream.

8. The method of claim 7, wherein the original language audio stream is obtained from a video stream.

9. The method of claim 8, wherein the at least two translated language streams are transmitted separately from the video stream.

10. The method of any of claims 7-9, wherein the browser extension is configured to provide a graphical user interface.

1 1. The method of any of claims 7 to 10, wherein the at least two translated language streams are human generated.

12. The method of any of claims 7 to 10, wherein the at least two translated language streams are machine generated.