WO2023129182A1

WO2023129182A1 - System, method and computer-readable medium for video processing

Info

Publication number: WO2023129182A1
Application number: PCT/US2021/073183
Authority: WO
Inventors: Shao Yuan Wu; Ming-Che Cheng
Original assignee: 17Live Japan Inc; 17Live USA Corp
Current assignee: 17Live Japan Inc; 17Live USA Corp
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-06
Anticipated expiration: 2024-06-30
Also published as: JP7449519B2; JP2024501091A

Abstract

The present disclosure relates to a system, a method and a computer-readable medium for live video processing. The method includes receiving a message from a user, and enlarging a region of the live video in the vicinity of a predetermined object. The present disclosure can facilitate the presenting and focusing of a live video.

Description

SYSTEM. METHOD AND COMPUTER-READABLE MEDIUM FOR VIDEO PROCESSING

Field of the Invention

[0001] The present disclosure relates to video processing in a video streaming.

Description of the Prior Art

[0002] Various technologies for enabling users to participate in mutual on-line communication are known. The applications include live streaming, live conference calls and the like. As these applications increase in popularity, user demands for improved communication efficiency and better understanding of each other’s message during the communication are rising.

Summary of the Invention

[0003] A method according to one embodiment of the present disclosure is a method for live video processing. The method includes receiving a message from a user, and enlarging a region of the live video in the vicinity of a predetermined object.

[0004] A system according to one embodiment of the present disclosure is a system for live video processing that includes one or a plurality of processors, and the one or plurality of processors execute a machine-readable instruction to perform: receiving a message from a user, and enlarging a region of the live video in the vicinity of a predetermined object.

[0005] A computer-readable medium according to one embodiment of the present disclosure is a non-transitory computer-readable medium including a program for live video processing, and the program causes one or a plurality of computers to execute: receiving a message from a user, and enlarging a region of the live video in the vicinity of a predetermined object.

Brief description of the drawings

[0006] FIG. 1 shows an example of a live streaming.

[0007] FIG. 2A, FIG. 2B, FIG. 2C, and FIG. 2D show exemplary streamings in accordance with some embodiments of the present disclosure. [0008] FIG. 3 shows an exemplary streaming in accordance with some embodiments of the present disclosure.

[0009] FIG. 4 shows a schematic configuration of a communication system according to some embodiments of the present disclosure.

[0010] FIG. 5 shows a block diagram of a user terminal according to some embodiments of the present disclosure.

[0011] FIG. 6 shows an exemplary look-up table in accordance with some embodiments of the present disclosure.

Detailed Description

[0012] Conventionally, compared with face-to-face communication, on-line communication has some disadvantages which may reduce the communication efficiency or increase the chances of misunderstanding. For example, during a live video or a live streaming communication, it is difficult to keep the focus on the correct region, especially when there are some distractions such as comments, special effects on the display wherein the live video is being displayed. For another example, during a live video or a live streaming communication, it is difficult to see the details of the video content due to the limited size of the display or the limited resolution of the video.

[0013] FIG. 1 shows an example of a live streaming. SI is a screen of a user terminal displaying the live streaming. RA is a display region within the screen SI displaying a live video of a user A. The live video of user A may be taken and provided by a video capturing device, such as a camera, positioned in the vicinity of user A. In this example, user A may be a streamer or a broadcastor who is distributing a live video to teach how to cook.

[0014] User A would like viewers of this live video to be able to focus on the right region of the video, and to be able to see the details of the region, in order for the viewers to get the correct knowledge such as cooking steps or cooking materials. Conventionally, user A may need to bring up the object of interest (such as a pan or a chopping board) closer to the camera for the users to see clearly. Or, user A may need to adjust a direction, a position or a focus of the camera for users to see the details user A wants to emphasize. The above actions are inconvenient for user A and interrupt the cooking process.

[0015] Therefore, it is desirable to have a method by which a user can indicate the region of interest in the live video and present the details of the region without having to stop the ongoing process. It is also desirable to have a method to help a viewer to focus on the correct region of a live video and to see the details of the region. The present disclosure can facilitate the presenting and focusing of a live video.

[0016] FIG. 2A, FIG. 2B, FIG. 2C, and FIG. 2D show exemplary streamings in accordance with some embodiments of the present disclosure.

[0017] Referring to FIG. 2A, user A sends out a message or a signal Ml. In this embodiment, the message Ml is a voice message indicating “zoom in.” In other embodiments, the message Ml may be a gesture message expressed by user A. For example, user A may use a body portion (such as a hand) to form a gesture message. In some embodiments, the message Ml may be a facial expression message expressed by user A. The message Ml is part of the video (including audio data) of user A.

[0018] The message Ml may be received by a user terminal used to capture the video of user A, such as a smartphone, a tablet, a laptop or any device with a video capturing function. In some embodiments, the message Ml is recognized by a user terminal used to produce or deliver the video of user A. In some embodiments, the message Ml is recognized by a system that provides the streaming service. In some embodiments, the message Ml is recognized by a server that supports the streaming service. In some embodiments, the message Ml is recognized by an application that supports the streaming service. In some embodiments, the message Ml is recognized by a voice recognition process, a gesture recognition process and/or a facial expression recognition process. In some embodiments, the message Ml may be an electrical signal, and can be transmitted and received by wireless connections.

[0019] Referring to FIG. 2B, objects 01 are recognized, and a region R1 is determined. The objects 01 are recognized according to the message Ml. In some embodiments, the recognition of the object 01 follows the receiving of the message Ml. In some embodiments, the receiving of the message Ml triggers the recognition of the object 01. In some embodiments, a recognition of the message Ml is done before the recognition of the object 01.

[0020] In this embodiment, the object 01 is set, taught or determined to be a body part (hands) of user A. In other embodiments, the object 01 may be determined to be a non-body object such as a chopping board or a pan. In some embodiments, the object 01 may be determined to be a wearable object on user A such as a watch, a bracelet or a sticker. The object 01 may be predetermined or set to be any object in the video of user A.

[0021] The region R1 is determined to be a region in the vicinity of the object 01. For example, the region R1 may be determined to be a region enclosing or surrounding all objects 01, thereby user A may control the size of the region R1 conveniently by controlling the positions of objects 01 (in this case, the objects 01 are her hands). A distance between an edge of the region R1 and the object 01 may be determined according to the actual practice. [0022] In some embodiments, different messages Ml may correspond to different predetermined objects 01. For example, user A may choose the object to be recognized, and the region to be determined, simply by sending out the corresponding message. For example, user A may speak “pan,” and then a pan (which is a predetermined object corresponding to the message “pan”) is recognized, and the region R1 would be determined to be a region in the vicinity of the pan.

[0023] In some embodiments, an object 01 is recognized by a user terminal used to capture the live video of user A. In some embodiments, an object 01 is recognized by a user terminal used to produce or deliver the video of user A. In some embodiments, an object 01 is recognized by a system that provides the streaming service. In some embodiments, an object 01 is recognized by a server that supports the streaming service. In some embodiments, an object 01 is recognized by an application that supports the streaming service.

[0024] In some embodiments, the region R1 is determined by a user terminal used to capture the live video of user A. In some embodiments, the region R1 is determined by a user terminal used to produce or deliver the video of user A. In some embodiments, the region R1 is determined by a system that provides the streaming service. In some embodiments, the region R1 is determined by a server that supports the streaming service. In some embodiments, the region R1 is determined by an application that supports the streaming service.

[0025] Referring to FIG. 2C, the region R1 is enlarged such that details of the video content within the region R1 can be seen clearly. The enlarged region R1 may cover or overlap a portion of the video of user A that is outside the region Rl. The enlarged region R1 may be displayed on any region of the screen SI.

[0026] In some embodiments, the enlarging process is performed by a user terminal used to capture the live video of user A. In some embodiments, the enlarging process is performed by a user terminal used to produce or deliver the video of user A. In some embodiments, the enlarging process is performed by a system that provides the streaming service. In some embodiments, the enlarging process is performed by a server that supports the streaming service. In some embodiments, the enlarging process is performed by an application that supports the streaming service. In some embodiments, the enlarging process is performed by a user terminal displaying the video of user A, such as a user terminal of a viewer.

[0027] In an embodiment wherein the enlarging process is performed by a user terminal that captures the video of user A, the user terminal can be configured to capture the region Rl (the region R1 may move according to a movement of an object 01) with a higher resolution compared to another region outside of the region Rl. Therefore, the region of the live video to be enlarged has a higher resolution than another region of the live video not to be enlarged. Therefore, the region to be emphasized can have more information for a viewer to see the details.

[0028] Referring to FIG. 2D, in some embodiments, except for the enlarged region Rl , other regions within the display region RA may be processed such that the enlarged region Rl stands out and becomes more obvious. For example, other regions may be darkened or blurred, such that a viewer can focus more easily on the region Rl.

[0029] FIG. 3 shows an exemplary streaming in accordance with some embodiments of the present disclosure.

[0030] Referring to FIG. 3, the object 01 is determined to be a wearable device or a wearable object on user A. The object 01 moves synchronously with a movement of user A, and the region of the live video to be enlarged moves synchronously with a movement of the object 01. Therefore, it is convenient for user A to determine which region to be enlarged or emphasized by simply controlling the position of the object 01. In some embodiments, enlarging a region of a live video and/ or moving the enlarged region are done with video processes executed by a user terminal, a server, or an application. Therefore, a direction of a video capturing device used to capture the live video can be kept fixed when the region of the live video to be enlarged moves synchronously with the movement of the predetermined object.

[0031] In some embodiments, a user may send out a first message to trigger a message recognition process, and then send out a second message to indicate which object to recognize. The object then determines the region to be enlarged. The first message and/or the second message can be or can include voice message, gesture message or facial expression message. In some embodiments, the first message can be referred to as a trigger message.

[0032] For example, user A may speak “focus” or “zoom in” to indicate that whatever he or she sends out next is for recognizing the object 01. Next, user A may speak “pan” such that a pan on the video would be recognized as the object 01. Subsequently, a region in the vicinity of the pan would be enlarged.

[0033] In some embodiments, the above configuration may save the resources used in message recognition. For example, a constantly ongoing message recognition process (which may include comparing the video information with a message table) can be only focused on the first message, which may be a single voice message. The second message may have more variants, each corresponding to a different object in the video. The message recognition process for the second message can be turned on only when the first message is received and/ or detected.

[0034] FIG. 4 shows a schematic configuration of a communication system according to some embodiments of the present disclosure. The communication system 1 may provide a live streaming service with interaction via a content. Here, the term “content” refers to a digital content that can be played on a computer device. The communication system 1 enables a user to participate in real-time interaction with other users on-line. The communication system 1 includes a plurality of user terminals 10, a backend server 30, and a streaming server 40. The user terminals 10, the backend server 30 and the streaming server 40 are connected via a network 90, which may be the Internet, for example. The backend server 30 may be a server for synchronizing interaction between the user terminals and/ or the streaming server 40. In some embodiments, the backend server 30 may be referred to as the origin server of an application (APP) provider. The streaming server 40 is a server for handling or providing streaming data or video data. In some embodiments, the backend server 30 and the streaming server 40 may be independent servers. In some embodiments, the backend server 30 and the streaming server 40 may be integrated into one server. In some embodiments, the user terminals 10 are client devices for the live streaming. In some embodiments, a user terminal 10 may be referred to as viewer, streamer, anchor, podcaster, audience, listener or the like. Each of the user terminals 10, the backend server 30, and the streaming server 40 is an example of an information-processing device. In some embodiments, the streaming may be live streaming or video replay. In some embodiments, the streaming may be audio streaming and/or video streaming. In some embodiments, the streaming may include contents such as online shopping, talk shows, talent shows, entertainment events, sports events, music videos, movies, comedy, concerts, group calls, conference calls or the like.

[0035] FIG. 5 shows a block diagram of a user terminal according to some embodiments of the present disclosure.

[0036] The user terminal 10S is a user terminal of a streamer or a broadcastor. The user terminal 10S includes a live video capturing unit 12, a message reception unit 13, an object identifying unit 14, a region determining unit 15, an enlarging unit 16, and a transmitting unit 17.

[0037] The live video capturing unit 12 includes a camera 122 and a microphone 124, and is configured to capture live video data (including audio data) of the streamer. [0038] The message reception unit 13 is configured to monitor voice stream (or image stream in some embodiments) in the live video, and to recognize a predetermined word (for example, “focus” or “zoom-in”) in the voice stream.

[0039] The object identifying unit 14 is configured to identify one or more predetermined objects in the live video, and to recognize the identified one or more objects in the image or the live video. The identification of objects may be done by a look-up table and the predetermined word recognized by the message reception unit 13, which will be described later. In another embodiment, the identification of objects may be done by the message reception unit 13.

[0040] The region determining unit 15 is configured to determine a region in the live video to be enlarged. The region to be enlarged is a region in the vicinity of the identified or recognized object.

[0041] The enlarging unit 16 is configured to perform video processes related to enlarging a region of a live video. In an embodiment wherein the region to be enlarged is captured with a higher resolution, the camera 122 may be involved in the enlarging process.

[0042] The transmitting unit 17 is configured to transmit the enlarged live video (or a live video with a region enlarged) to a server (such as a streaming server) if the enlarging process is performed. If an enlarging process is not performed, the transmitting unit 17 transmits the live video captured by the live video capturing unit 12.

[0043] FIG. 6 shows an exemplary look-up table in accordance with some embodiments of the present disclosure, which may be utilized by the object identifying unit 14 of FIG. 5. [0044] The column “predetermined word” indicates the words to be identified in the voice stream of the live video. The column “object” indicates the object corresponding to each predetermined word to be recognized. For example, in this example, an identified “zoom-in” leads to recognition of the streamer’s hand in the live video, an identified “pan” leads to recognition of a pan in the live video, an identified “board please” leads to recognition of a chopping board in the live video.

[0045] In some embodiments, the predetermined words or the objects are pre-set by a user. In some embodiments, the predetermined words or the objects may be auto-created through Al or machine learning.

[0046] The processing and procedures described in the present disclosure may be realized by software, hardware, or any combination of these in addition to what was explicitly described. For example, the processing and procedures described in the specification may be realized by implementing a logic corresponding to the processing and procedures in a medium such as an integrated circuit, a volatile memory, a non-volatile memory, a non-transitory computer-readable medium and a magnetic disk. Further, the processing and procedures described in the specification can be implemented as a computer program corresponding to the processing and procedures, and can be executed by various kinds of computers.

[0047] The system or method described in the above embodiments may be integrated into programs stored in a computer-readable non-transitory medium such as a solid state memory device, an optical disk storage device, or a magnetic disk storage device. Alternatively, the programs may be downloaded from a server via the Internet and be executed by processors. [0048] Although technical content and features of the present invention are described above, a person having common knowledge in the technical field of the present invention may still make many variations and modifications without disobeying the teaching and disclosure of the present invention. Therefore, the scope of the present invention is not limited to the embodiments that are already disclosed, but includes another variation and modification that do not disobey the present invention, and is the scope covered by the patent application scope.

Description of reference numerals si Screen

RA Region

01 Object

R1 Region

1 System

10 User terminal

30 Backend server

40 Streaming server

90 Network

10S User terminal

12 Live video capturing unit

122 Camera

124 Microphone

13 Message reception unit

14 Object identifying unit

15 Region determining unit

16 Enlarging unit

17 Transmitting unit

Claims

We Claim:

1. A method for live video processing, comprising: receiving a message from a user while live video created by the user is being broadcasted; and enlarging a region of the live video in the vicinity of a predetermined object according to the message.

2. The method according to claim 1, further comprising recognizing the predetermined object in the live video according to the message.

3. The method according to claim 2, further comprising receiving a trigger message from the user, wherein the trigger message triggers the recognizing the predetermined object in the live video according to the message.

4. The method according to claim 1, wherein the message comprises a voice message, a gesture message, or a facial expression message.

5. The method according to claim 1, further comprising recognizing the message from the user.

6. The method according to claim 5, wherein the recognizing the message from the user comprises a voice recognition process, a gesture recognition process, or a facial expression recognition process.

7. The method according to claim 1, wherein the predetermined object comprises a body part of the user or a wearable object on the user.

8. The method according to claim 1, wherein the predetermined object moves synchronously with a movement of the user.

9. The method according to claim 1, wherein the message corresponds to the predetermined object.

10. The method according to claim 1, wherein the region of the live video to be enlarged is captured by a video capturing device with a higher resolution than another region of the live video not to be enlarged.

11. The method according to claim 1, wherein the region of the live video to be enlarged moves synchronously with a movement of the predetermined object.

12. The method according to claim 11, wherein the live video is generated by a video capturing device in the vicinity of the user, and a direction of the video capturing device are kept fixed when the region of the live video to be enlarged moves synchronously with the movement of the predetermined object.

13. A system for live video processing, comprising one or a plurality of processors, wherein the one or plurality of processors execute a machine-readable instruction to perform: receiving a message from a user while live video created by the user is being broadcasted; and enlarging a region of the live video in the vicinity of a predetermined object according to the message.

14. A non-transitory computer-readable medium including a program for live video processing, wherein the program causes one or a plurality of computers to execute: receiving a message from a user while live video created by the user is being broadcasted; and enlarging a region of the live video in the vicinity of a predetermined object according to the message.