CN112052358B

CN112052358B - Method, apparatus, electronic device, and computer-readable medium for displaying image

Info

Publication number: CN112052358B
Application number: CN202010929638.9A
Authority: CN
Inventors: 邓启力
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2024-08-20
Anticipated expiration: 2040-09-07
Also published as: CN112052358A

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, electronic devices, and computer-readable media for displaying images. One embodiment of the method comprises the following steps: extracting audio information and an image to be processed from the video to be processed; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and responding to the triggering of the target pronunciation information in the audio information, and displaying a target image corresponding to the target pronunciation information in the video to be processed. According to the embodiment, the user can clearly determine the information transfer relationship between the target pronunciation information and the target object image, the visual effect is enhanced, and the effectiveness of the user in acquiring the information in the video to be processed is improved.

Description

Method, apparatus, electronic device, and computer-readable medium for displaying image

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer readable medium for displaying an image.

Background

The video contains various information such as audio information, image information and the like, and can provide vivid visual experience for users. In practice, the relationship between the audio information and the image information in the video is not obvious enough, so that the user is not easy to know the relationship between the audio information and the image information, and the effectiveness of acquiring the information in the video by the user is reduced.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose methods, apparatuses, electronic devices, and computer-readable media for displaying images to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method of displaying an image, the method comprising: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and responding to the triggering of the target pronunciation information in the audio information, and displaying a target image corresponding to the target pronunciation information in the video to be processed.

In a second aspect, some embodiments of the present disclosure provide an apparatus for displaying an image, the apparatus comprising: an information extraction unit configured to extract audio information including at least one sound information and an image to be processed including an object image corresponding to sound information in the at least one sound information from a video to be processed; a relationship establishing unit configured to establish a correspondence between the target pronunciation information included in the audio information and the target object image included in the image to be processed; a target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by performing image processing on the target object image; and an image display unit configured to display a target image corresponding to the target pronunciation information in the video to be processed in response to the target pronunciation information in the audio information being triggered.

In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; and a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to perform the method of displaying an image of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, characterized in that the program when executed by a processor implements the method of displaying an image of the first aspect described above.

One of the above embodiments of the present disclosure has the following advantageous effects: firstly, extracting audio information and an image to be processed from the video to be processed, then establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed, and defining the information transfer relation. And then, setting a target image corresponding to each target object image contained in the image to be processed, thereby being beneficial to deepening the experience of watching the video for the user. And finally, when the target pronunciation information in the audio information is triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed. The method can enable the user to clearly determine the information transfer relationship between the target pronunciation information and the target object image, enhance the visual effect and be beneficial to improving the effectiveness of the user in acquiring the information in the video to be processed.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a schematic illustration of an application scenario of a method of displaying images of some embodiments of the present disclosure;

FIG. 2 is a flow chart of some embodiments of a method of displaying an image according to the present disclosure;

FIG. 3 is a flow chart of other embodiments of a method of displaying an image according to the present disclosure;

FIG. 4 is a flow chart of yet other embodiments of a method of displaying an image according to the present disclosure;

FIG. 5 is a schematic structural view of some embodiments of an apparatus for displaying images according to the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of one application scenario of a method of displaying an image of some embodiments of the present disclosure.

In the application scenario of fig. 1, the electronic device 101 extracts audio information and an image to be processed from a video to be processed. The image to be processed comprises two persons, and the audio information is voice information sent by a first person to a second person. The electronic device 101 may query the target pronunciation information 102 from the audio information and the target object image 103 corresponding to the target pronunciation information 102. The target pronunciation information 102 may be "watch movie", "scroll", etc. The target object image 103 may be an ear image of the second person. Then, the electronic device 101 may establish a correspondence relationship between the target pronunciation information 102 and the target object image 103, which may be used to indicate that "watch movie", "shopping" is acting on the second person's ear. In order to increase the visual effect when the user views the video to be processed, the electronic device 101 may set the target image 104 of the target object image 103. The target image 104 may be a pair of rabbit ears that may exaggeratedly represent that the second person is listening to the first person's voice information, thereby enhancing the visual impression of the user. When the target pronunciation information 102 is played (i.e., triggered) in the video to be processed, the electronic device 101 may display the target image 104.

It should be understood that the number of electronic devices 101 in fig. 1 is merely illustrative. There may be any number of electronic devices 101, as desired for implementation.

With continued reference to fig. 2, fig. 2 illustrates a flow 200 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

In step 201, audio information and an image to be processed are extracted from a video to be processed.

In some embodiments, the execution subject of the method of displaying an image (e.g., the electronic device 101 shown in fig. 1) may extract audio information and an image to be processed from a video to be processed through a wired connection or a wireless connection. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

The executing body may first acquire the video to be processed. The video to be processed may be a real-time video containing a face image, or may be a non-real-time video containing a face image. The facial image may be a human facial image, an animal facial image, a figurine facial image, or the like. The execution subject may extract audio information in the video to be processed and the image to be processed. The audio information may include at least one pronunciation information. The pronunciation information may be various audio frequencies, for example, the pronunciation information may be human voice audio, bird call audio, car engine audio, environmental noise, etc. The image to be processed may include an object image corresponding to pronunciation information in the at least one pronunciation information. The execution subject may perform image recognition on the image to be processed, first to identify a head image of a person, animal or statue in the image to be processed, and then specifically identify each portion (for example, eyes, nose, mouth, etc.) included in the head image. The pronunciation information may be various, some pronunciation information may have object images, and some pronunciation information may have no object images. For example, the pronunciation information is a voice uttered by a first person to a second person, and the object image may be an ear of the second person. The pronunciation information is ambient noise, and there may be no object image.

Step 202, establishing a correspondence between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed.

From the above description, some pronunciation information may have a corresponding object image. For the convenience of analysis, the execution subject may set the pronunciation information of the existence object image as target pronunciation information and set the object image as a target object image corresponding to the target pronunciation information. After that, the execution subject may establish a correspondence relationship between the target pronunciation information and the target object image. It should be noted that one target pronunciation information may have a plurality of target object images. For example, the video to be processed is a video for students to give lessons, one teacher gives lessons, and a plurality of students listen to lessons. Correspondingly, the target pronunciation information may be a teacher's voice, and the target object image may be an ear image of each student. The plurality of target pronunciation information may have one target image. For example, a plurality of students are simultaneously carrying lessons and a teacher listens. Correspondingly, the target pronunciation information is the voice of each student, and the target object image can be an ear image of a teacher. The plurality of target pronunciation information may be a plurality of target images. For example, multiple people chose a song at the same time, and multiple listeners listen to the song. Correspondingly, the target pronunciation information is singing voice of each person, and the target object image is an ear image of each listener. Therefore, the information transfer relationship is clarified, and the efficiency of information acquisition by the user is improved.

And 203, setting a target image corresponding to each target object image contained in the image to be processed.

In order to enhance the user's experience of watching video, the execution subject may set a target image for each target object image. The target image may be an image obtained by performing image processing on the target image. For example, the target image may be an enlarged image, a deformed image, or the like of the target object image. The target image may also be a substitute image for the corresponding target object image. As in fig. 1, the "ear" of the second person is characterized by a "rabbit ear". In addition, the target image may be an image with additional special effects (e.g., a train whistle, a flashing star, etc.).

And step 204, in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

When the target pronunciation information in the audio information is triggered (such as playing the target pronunciation information), the executing body can display a target image corresponding to the target pronunciation information in the video to be processed. Therefore, the information transfer relationship between the target pronunciation information and the target object image can be clarified by the user, the visual effect is enhanced, and the accuracy and the effectiveness of the information in the video to be processed are improved.

According to the method for displaying the image disclosed by some embodiments of the present disclosure, firstly, audio information and an image to be processed are extracted from a video to be processed, then, a corresponding relationship between target pronunciation information contained in the audio information and a target object image contained in the image to be processed is established, and the information transfer relationship is clarified. And then, setting a target image corresponding to each target object image contained in the image to be processed, thereby being beneficial to deepening the experience of watching the video for the user. And finally, when the target pronunciation information in the audio information is triggered, displaying a target image corresponding to the target pronunciation information in the video to be processed. The method can enable the user to clearly determine the information transfer relationship between the target pronunciation information and the target object image, enhance the visual effect and be beneficial to improving the effectiveness of the user in acquiring the information in the video to be processed.

With continued reference to fig. 3, fig. 3 illustrates a flow 300 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

In step 301, audio information and an image to be processed are extracted from a video to be processed.

The content of step 301 is the same as that of step 201, and will not be described in detail here.

In some optional implementations of some embodiments, the extracting the audio information and the image to be processed from the video to be processed may include the following steps:

first, at least one object content in the image to be processed is identified.

The execution subject may recognize at least one object content in the image to be processed by various image recognition methods. Wherein the object content includes at least one of: facial images, mouth images, ear images.

And a second step of setting a contour line for the object content in the at least one object content to obtain an object image.

In order to be able to accurately display the object contents, the execution body may set an outline for each object content. The contour lines are used to identify boundaries of the object content. That is, the object image is an image obtained by adding a contour line to the object content.

Step 302, determining a trigger time of the pronunciation information based on the time stamp for each pronunciation information included in the audio information.

In general, the video to be processed may include a time stamp, where the time stamp may be used to mark a playing time of the video to be processed, and may also be used to mark a correspondence between audio information and an image to be processed. The pronunciation information contained in the audio information generally occurs at the corresponding time. Accordingly, the execution body can determine the trigger time of the pronunciation information according to the time stamp. Wherein the trigger time may be used to characterize the time at which the pronunciation information is played. For example, when the trigger time of the sound information corresponds to a time stamp of 1 minute and 20 seconds, the sound information is played at 1 minute and 20 seconds.

And step 303, marking the pronunciation information as target pronunciation information and marking the object image as target object image in response to the object image deformed in the set time period after the trigger time.

In order to establish correspondence of the pronunciation information and the object image, the execution subject may query correlation of the pronunciation information and the object image over time. In general, after the sound information is played, the target image corresponding to the sound information changes accordingly. For example, when a first person speaks a joke and a second person hears a joke, the target pronunciation information is the audio of the joke, and the target object image may be a mouth image and an ear image of the second person. Although the ear image has no obvious deformation, the audio is firstly transmitted into the ear, and then the brain understands the audio to control the deformation of the mouth. Therefore, the ear image can also be regarded as the target object image.

Step 304, establishing a corresponding relation between the target pronunciation information and the target object image.

After determining the target pronunciation information and the target object image, the execution subject may establish a correspondence between the target pronunciation information and the target object image. For example, when the target utterance information is played for a set time (for example, may be 2 seconds), the target object image may be displayed.

In step 305, a target image corresponding to each target object image included in the image to be processed is set.

And step 306, in response to the triggering of the target pronunciation information in the audio information, displaying a target image corresponding to the target pronunciation information in the video to be processed.

The contents of steps 305 to 306 are the same as those of steps 203 to 204, and will not be described in detail here.

In the corresponding embodiment of fig. 3, the target object image corresponding to the target pronunciation information is determined by the timestamp of the video to be processed. Therefore, the corresponding relation between the target pronunciation information and the target object image is established, and the accuracy of displaying the target image when the target pronunciation information is triggered is improved.

With continued reference to fig. 4, fig. 4 illustrates a flow 400 of some embodiments of a method of displaying an image according to the present disclosure. The method for displaying the image comprises the following steps:

in step 401, audio information and an image to be processed are extracted from a video to be processed.

Step 402, establishing a correspondence between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed.

Steps 401 to 402 are the same as steps 201 to 202, and will not be described in detail here.

And step 403, amplifying the target object image according to each amplification factor in the preset amplification factor sequence to obtain a target image sequence.

The execution subject may enlarge the target object image to obtain a target image corresponding to the target object image. When playing the target audio information, the user sees the target image after the target object image is enlarged. Therefore, the accuracy and the effectiveness of information acquisition when the user views the video to be processed are improved. Specifically, the execution body may amplify the target object image according to each amplification factor in the preset amplification factor sequence, so as to obtain a target image sequence. Wherein the magnification in the magnification sequence may be sequentially increased. Further, each of the amplification factors in the amplification factor sequence may correspond to a volume magnitude of the sound emission information.

Step 404, determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence based on the target volume amplitude for display.

As can be seen from the above description, the target image sequence includes the target image obtained by enlarging the target object image according to each magnification in the preset magnification sequence. Based on this, the execution subject can establish a correspondence relationship between each magnification in the magnification sequence and the volume amplitude of the sound production information. The executing body can determine the target volume amplitude of the target pronunciation information, and then select a target image corresponding to the target volume amplitude from the target image sequence to display. Thus, the corresponding relation between the target pronunciation information and the target object is established through the volume amplitude. When the target volume amplitude is smaller, the displayed target image is smaller; when the target volume amplitude is large, the displayed target image is also large. That is, the size of the target image is different for different volume magnitudes, as is the visual experience of the user viewing the target image. Therefore, the corresponding relation between the target pronunciation information and the target image can be displayed in an image mode, and accuracy and effectiveness of information acquisition when a user views the video to be processed are improved.

In some optional implementations of some embodiments, selecting a corresponding target image from the target image sequence for display based on the target volume amplitude may include the following steps:

First, setting an image axis and a rotation angle for the target image in response to the target volume magnitude being greater than a set volume threshold.

When the target volume amplitude is greater than the set volume threshold, it can be considered that the target volume amplitude has exceeded the normal volume amplitude to give the user an auditory effect. At this time, the execution subject may set the image axis and the rotation angle for the target image. Wherein the image axis may be a line through the target image. The rotation angle may be any angle. Typically, the rotation angle may take a value between 10 ° and 30 °. The rotation angle may be set based on the image axis. For example, the image axis may be a side of the rotation angle, may be an angular bisector of the rotation angle, and the like, specifically according to actual needs.

And a second step of generating a first image and a second image based on the target image.

The execution subject may generate the first image and the second image from the target image. Wherein the first image and the second image may be the same as the target image. The first image axis included in the first image and the second image axis included in the second image may correspond to the image axis of the target image, respectively.

And thirdly, setting the first image axis of the first image to be coincident with one side of the rotation angle, and setting the second image axis of the second image to be coincident with the other side of the rotation angle.

The execution body may coincide the first image axis and the second image axis with both sides of the rotation angle, respectively. In this way, the positional relationship and the angular relationship of the first image and the second image can be determined. Further, the first image and the second image may be disposed to be angularly coincident. When the two sides of the rotation angle are overlapped, the first image and the second image are overlapped.

And fourth, alternately displaying the first image or the second image according to the set frequency.

The execution body may set a set frequency such that the first image and the second image are alternately displayed. Thus, when the target volume amplitude is larger than the set volume amplitude, the target image can be made to show a dynamic effect. For example, the target pronunciation information is a joke, and the target image is an ear image. When the target volume amplitude of the target pronunciation information is larger than the set volume amplitude, the ear images are alternately displayed to generate vibration effects. Therefore, the user experience can be deepened, and the accuracy and the effectiveness of information acquisition of the user are improved.

In some optional implementations of some embodiments, the displaying, in the video to be processed, a target image corresponding to the target pronunciation information in response to the target pronunciation information in the audio information being triggered may include the steps of:

first, a real-time image of a target object corresponding to the target pronunciation information is acquired.

In practice, the object images in the object to be processed may change over time. In order to be able to present the object image in real time. The execution subject can acquire a real-time image of the target object of the target pronunciation information. The execution subject can acquire a real-time image of a target object at intervals of set time. That is, the target real-time image may be an image of the target within a set period of time after the time when the target sound information is triggered. The real-time image of the target object can be understood as an image obtained by amplifying the image of the target object in real time within a set period of time. In contrast, the target image in fig. 3 is a relatively still image.

And step two, dynamically generating a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

After the target object real-time image is obtained, the execution subject can dynamically generate the real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information. Therefore, the user can acquire the real-time target image, and timeliness and effectiveness of acquiring information by the user are improved.

and determining a connection area of the target object image corresponding to the target image and the image to be processed.

The target image is an enlarged view of the target object image. The execution subject may determine a connection region between the target object image corresponding to the target image and the image to be processed. For example, if the target image is an ear image, the connection region may be an image region of a set size between the ear image and a head image in the image to be processed.

And secondly, carrying out transition processing on the connection area based on the image to be processed and the target image.

The execution subject may perform transition processing on the connection region by the image to be processed and the target image. The transition treatment comprises at least one of the following: color transition, line transition, etc. Therefore, the target image and the image to be processed are combined more naturally, the display effect of the image to be processed is improved, the viewing experience of a user is deepened, and the accuracy and the effectiveness of information acquisition of the user are improved.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of an apparatus for displaying images, which apparatus embodiments correspond to those shown in fig. 2, and which apparatus is particularly applicable in various electronic devices.

As shown in fig. 5, an apparatus 500 for displaying an image of some embodiments includes: an information extraction unit 501, a relationship establishment unit 502, a target object setting unit 503, and an image display unit 504. Wherein the information extraction unit 501 is configured to extract audio information and an image to be processed from a video to be processed, the audio information including at least one pronunciation information, the image to be processed including an object image corresponding to pronunciation information in the at least one pronunciation information; a relationship establishing unit 502 configured to establish a correspondence between the target pronunciation information included in the audio information and the target object image included in the image to be processed; a target object setting unit 503 configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by performing image processing on the target object image; an image display unit 504, responsive to the target pronunciation information in the audio information being triggered, is configured to display a target image corresponding to the target pronunciation information in the video to be processed.

In an alternative implementation of some embodiments, the information extraction unit 501 may include: an object recognition subunit (not shown) and an object image acquisition subunit (not shown). Wherein the object recognition subunit is configured to recognize at least one object content in the image to be processed, and the object content comprises at least one of the following: facial images, mouth images, ear images; an object image acquisition subunit, for the object content in the at least one object content, configured to set a contour line for the object content, and obtain an object image.

In an alternative implementation of some embodiments, the video to be processed includes a timestamp, and the relationship establishing unit 502 may include: a trigger time determination subunit (not shown), a marking subunit (not shown), and a relationship establishment subunit (not shown). Wherein, for each pronunciation information contained in the audio information, the trigger time determining subunit is configured to determine the trigger time of the pronunciation information based on the timestamp; a marking subunit configured to mark the pronunciation information as target pronunciation information and mark the object image as target object image in response to the object image deformed in a set period of time after the trigger time; and a relationship establishing subunit configured to establish a correspondence between the target pronunciation information and the target object image.

In an alternative implementation of some embodiments, the target object setting unit 503 may include: a target object setting subunit (not shown in the figure) configured to amplify the target object image according to each magnification in a preset magnification sequence, so as to obtain a target image sequence, wherein the magnifications in the magnification sequence are increased in order.

In an alternative implementation manner of some embodiments, each magnification in the magnification sequence corresponds to a volume amplitude of the pronunciation information, and the image display unit 504 may include: a first image display subunit (not shown in the figure) configured to determine a target volume amplitude of the target pronunciation information, and select a corresponding target image from the target image sequence for display based on the target volume amplitude.

In an alternative implementation of some embodiments, the first image display subunit may include: a setting module (not shown in the figure), an image generating module (not shown in the figure), a position setting module (not shown in the figure), and a display module (not shown in the figure). Wherein, the setting module, respond to the above-mentioned goal volume amplitude and is greater than presuming the volume threshold value, configure to set up the axis of the picture and rotation angle for the above-mentioned goal picture; an image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image corresponding to an image axis of the target image, respectively; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle, and a second image axis of the second image to coincide with the other side of the rotation angle; and a display module configured to alternately display the first image or the second image at a set frequency.

In an alternative implementation of some embodiments, the image display unit 503 may include: an image acquisition subunit (not shown) and a second image display subunit (not shown). The image acquisition subunit is configured to acquire a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image in a set time period after the moment when the target pronunciation information is triggered; and the second image display subunit is configured to dynamically generate a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

In an alternative implementation of some embodiments, the image display unit 503 may include: a connection region determining subunit (not shown in the figure) and an over-processing subunit (not shown in the figure). Wherein, the connection area determining subunit is configured to determine a connection area between the target object image corresponding to the target image and the image to be processed; an transition processing subunit configured to perform transition processing on the connection region based on the image to be processed and the target image, the transition processing including at least one of: color transition, line transition.

It will be appreciated that the elements described in the apparatus 500 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting benefits described above with respect to the method are equally applicable to the apparatus 500 and the units contained therein, and are not described in detail herein.

As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 609, or from storage device 608, or from ROM 602. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that, in some embodiments of the present disclosure, the computer readable medium may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and responding to the triggering of the target pronunciation information in the audio information, and displaying a target image corresponding to the target pronunciation information in the video to be processed.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an information extraction unit, a relationship establishment unit, a target object setting unit, and an image display unit. The names of these units do not constitute limitations on the unit itself in some cases, and for example, the image display unit may also be described as "a unit for displaying a target image".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

According to one or more embodiments of the present disclosure, there is provided a method of displaying an image, including: extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one pronunciation information; establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed; setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image; and responding to the triggering of the target pronunciation information in the audio information, and displaying a target image corresponding to the target pronunciation information in the video to be processed.

According to one or more embodiments of the present disclosure, the extracting audio information and the image to be processed from the video to be processed includes: identifying at least one object content in the image to be processed, wherein the object content comprises at least one of the following items: facial images, mouth images, ear images; and setting an outline for the object content in the at least one object content to obtain an object image.

According to one or more embodiments of the present disclosure, the video to be processed includes a timestamp, and the establishing a correspondence between the target pronunciation information included in the audio information and the target object image included in the image to be processed includes: determining a trigger time of the sound information based on the time stamp for each sound information included in the audio information; in response to the object image deformed within a set time period after the trigger time, marking the pronunciation information as target pronunciation information, and marking the object image as target object image; and establishing a corresponding relation between the target pronunciation information and the target object image.

According to one or more embodiments of the present disclosure, the setting a target image corresponding to each target object image included in the image to be processed includes: amplifying the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

According to one or more embodiments of the present disclosure, each magnification in the magnification sequence corresponds to a volume magnitude of the pronunciation information, and the displaying, in the video to be processed, a target image corresponding to the target pronunciation information in response to the target pronunciation information in the audio information being triggered includes: and determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence to display based on the target volume amplitude.

According to one or more embodiments of the present disclosure, the selecting, for display, a corresponding target image from the target image sequence based on the target volume amplitude value includes: setting an image axis and a rotation angle for the target image in response to the target volume magnitude being greater than a set volume threshold; generating a first image and a second image based on the target image, wherein the first image and the second image are identical to the target image, and a first image axis contained in the first image and a second image axis contained in the second image respectively correspond to an image axis of the target image; setting a first image axis of the first image to coincide with one side of the rotation angle, and setting a second image axis of the second image to coincide with the other side of the rotation angle; the first image or the second image is alternately displayed according to a set frequency.

According to one or more embodiments of the present disclosure, the displaying, in the video to be processed, a target image corresponding to target pronunciation information in response to the target pronunciation information in the audio information being triggered includes: acquiring a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image in a set time period after the moment when the target pronunciation information is triggered; and dynamically generating a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

According to one or more embodiments of the present disclosure, the displaying, in the video to be processed, a target image corresponding to target pronunciation information in response to the target pronunciation information in the audio information being triggered includes: determining a connection area between the target object image corresponding to the target image and the image to be processed; performing transition processing on the connection region based on the image to be processed and the target image, wherein the transition processing comprises at least one of the following steps: color transition, line transition.

According to one or more embodiments of the present disclosure, there is provided an apparatus for displaying an image, including: an information extraction unit configured to extract audio information including at least one sound information and an image to be processed including an object image corresponding to sound information in the at least one sound information from a video to be processed; a relationship establishing unit configured to establish a correspondence between the target pronunciation information included in the audio information and the target object image included in the image to be processed; a target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by performing image processing on the target object image; and an image display unit configured to display a target image corresponding to the target pronunciation information in the video to be processed in response to the target pronunciation information in the audio information being triggered.

According to one or more embodiments of the present disclosure, the above information extraction unit includes: an object recognition subunit configured to recognize at least one object content in the image to be processed, the object content including at least one of: facial images, mouth images, ear images; an object image acquisition subunit, for the object content in the at least one object content, configured to set a contour line for the object content, and obtain an object image.

According to one or more embodiments of the present disclosure, the video to be processed includes a time stamp, and the relationship establishing unit includes: a trigger time determining subunit configured to determine, for each of the sound information included in the audio information, a trigger time of the sound information based on the time stamp; a marking subunit configured to mark the pronunciation information as target pronunciation information and mark the object image as target object image in response to the object image deformed in a set period of time after the trigger time; and a relationship establishing subunit configured to establish a correspondence between the target pronunciation information and the target object image.

According to one or more embodiments of the present disclosure, the above-described target object setting unit includes: and the target object setting subunit is configured to amplify the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

According to one or more embodiments of the present disclosure, each of the amplification factors in the amplification factor sequence corresponds to a volume amplitude of the pronunciation information, and the image display unit includes: and the first image display subunit is configured to determine a target volume amplitude value of the target pronunciation information, and select a corresponding target image from the target image sequence to display based on the target volume amplitude value.

According to one or more embodiments of the present disclosure, the first image display subunit includes: a setting module configured to set an image axis and a rotation angle for the target image in response to the target volume magnitude being greater than a set volume threshold; an image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis included in the first image and a second image axis included in the second image corresponding to an image axis of the target image, respectively; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle, and a second image axis of the second image to coincide with the other side of the rotation angle; and a display module configured to alternately display the first image or the second image at a set frequency.

According to one or more embodiments of the present disclosure, the above-described image display unit includes: an image acquisition subunit configured to acquire a real-time image of a target object corresponding to the target pronunciation information, where the real-time image of the target object is an object image within a set time period after a time when the target pronunciation information is triggered; and the second image display subunit is configured to dynamically generate a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

According to one or more embodiments of the present disclosure, the above-described image display unit includes: a connection region determining subunit configured to determine a connection region between the target object image corresponding to the target image and the image to be processed; an transition processing subunit configured to perform transition processing on the connection region based on the image to be processed and the target image, the transition processing including at least one of: color transition, line transition.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A method of displaying an image, comprising:

extracting audio information and an image to be processed from a video to be processed, wherein the audio information comprises at least one pronunciation information, and the image to be processed comprises an object image corresponding to the pronunciation information in the at least one pronunciation information; wherein the object image comprises at least one of: facial images, mouth images, ear images;

establishing a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed;

setting a target image corresponding to each target object image contained in the image to be processed, wherein the target image is an image obtained by performing image processing on the target object image;

responding to the triggering of the target pronunciation information in the audio information, and displaying a target image corresponding to the target pronunciation information in the video to be processed;

The establishing the correspondence between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed further comprises:

setting pronunciation information of a presence object image as target pronunciation information, and setting the object image as a target object image corresponding to the target pronunciation information;

and establishing a corresponding relation between the target pronunciation information and the target object image.

2. The method of claim 1, wherein the extracting audio information and the image to be processed from the video to be processed comprises:

Identifying at least one object content in the image to be processed;

and setting an outline for the object content in the at least one object content to obtain an object image.

3. The method of claim 1, wherein the video to be processed includes a timestamp, and

The establishing a correspondence between the target pronunciation information contained in the audio information and the target object image contained in the image to be processed comprises the following steps:

for each pronunciation information contained in the audio information, determining a trigger time of the pronunciation information based on the time stamp;

Marking the pronunciation information as target pronunciation information and marking the object image as target object image in response to the object image deformed within a set time period after the trigger time;

4. The method according to claim 1, wherein the setting a target image corresponding to each target object image contained in the image to be processed includes:

Amplifying the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

5. The method of claim 4, wherein each magnification in the sequence of magnifications corresponds to a volume magnitude of the pronunciation information, and

The responding to the target pronunciation information in the audio information is triggered, and the target image corresponding to the target pronunciation information is displayed in the video to be processed, which comprises the following steps:

and determining a target volume amplitude of the target pronunciation information, and selecting a corresponding target image from the target image sequence to display based on the target volume amplitude.

6. The method of claim 5, wherein the selecting for display a corresponding target image from the sequence of target images based on the target volume magnitude comprises:

Setting an image axis and a rotation angle for the target image in response to the target volume magnitude being greater than a set volume threshold;

Generating a first image and a second image based on the target image, wherein the first image and the second image are identical to the target image, and a first image axis contained in the first image and a second image axis contained in the second image respectively correspond to the image axes of the target image; setting a first image axis of the first image to coincide with one side of the rotation angle, and setting a second image axis of the second image to coincide with the other side of the rotation angle;

And alternately displaying the first image or the second image according to a set frequency.

7. The method of claim 5, wherein the displaying, in the video to be processed, a target image corresponding to target pronunciation information in response to the target pronunciation information in the audio information being triggered, comprises:

Acquiring a target object real-time image corresponding to the target pronunciation information, wherein the target object real-time image is an object image in a set time period after the moment when the target pronunciation information is triggered;

And dynamically generating a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

8. The method of any of claims 1 to 7, wherein the displaying, in the video to be processed, a target image corresponding to target pronunciation information in response to the target pronunciation information in the audio information being triggered, comprises:

Determining a connection area of the target object image corresponding to the target image and the image to be processed;

Performing transition processing on the connection region based on the image to be processed and the target image, wherein the transition processing comprises at least one of the following steps: color transition, line transition.

9. An apparatus for displaying an image, comprising:

An information extraction unit configured to extract audio information including at least one sound information and an image to be processed including an object image corresponding to sound information in the at least one sound information from a video to be processed; wherein the object image comprises at least one of: facial images, mouth images, ear images;

the relation establishing unit is configured to establish a corresponding relation between target pronunciation information contained in the audio information and a target object image contained in the image to be processed;

A target object setting unit configured to set a target image corresponding to each target object image included in the image to be processed, the target image being an image obtained by performing image processing on the target object image;

an image display unit configured to display a target image corresponding to target pronunciation information in the video to be processed in response to the target pronunciation information in the audio information being triggered;

wherein the relationship establishing unit is further configured to:

10. The apparatus of claim 9, wherein the information extraction unit comprises:

An object recognition subunit configured to recognize at least one object content in the image to be processed;

An object image acquisition subunit, for an object content in the at least one object content, is configured to set a contour line for the object content, so as to obtain an object image.

11. The apparatus of claim 9, wherein the video to be processed comprises a timestamp, and

The relationship establishing unit includes:

A trigger time determining subunit configured to determine, for each of the pronunciation information contained in the audio information, a trigger time of the pronunciation information based on the time stamp;

A marking subunit configured to mark the pronunciation information as target pronunciation information and mark the object image as target object image in response to the object image deformed in a set period of time after the trigger time exists;

And the relation establishing subunit is configured to establish a corresponding relation between the target pronunciation information and the target object image.

12. The apparatus of claim 9, wherein the target object setting unit comprises:

And the target object setting subunit is configured to amplify the target object image according to each amplification factor in a preset amplification factor sequence to obtain a target image sequence, wherein the amplification factors in the amplification factor sequence are increased in sequence.

13. The apparatus of claim 12, wherein each magnification in the sequence of magnifications corresponds to a volume magnitude of the pronunciation information, and

The image display unit includes:

And the first image display subunit is configured to determine a target volume amplitude value of the target pronunciation information, and select a corresponding target image from the target image sequence to display based on the target volume amplitude value.

14. The apparatus of claim 13, wherein the first image display subunit comprises:

a setting module configured to set an image axis and a rotation angle for the target image in response to the target volume magnitude being greater than a set volume threshold;

An image generation module configured to generate a first image and a second image based on the target image, the first image and the second image being identical to the target image, a first image axis contained in the first image and a second image axis contained in the second image corresponding to an image axis of the target image, respectively; a position setting module configured to set a first image axis of the first image to coincide with one side of the rotation angle, and a second image axis of the second image to coincide with the other side of the rotation angle;

and a display module configured to alternately display the first image or the second image at a set frequency.

15. The apparatus of claim 13, wherein the image display unit comprises:

an image acquisition subunit configured to acquire a real-time image of a target object corresponding to the target pronunciation information, the real-time image of the target object being an object image within a set time period after a time at which the target pronunciation information is triggered;

and the second image display subunit is configured to dynamically generate a real-time target image of the target object real-time image according to the target volume amplitude of the target pronunciation information.

16. The apparatus according to any one of claims 9 to 15, wherein the image display unit includes:

a connection region determining subunit configured to determine a connection region between a target object image corresponding to the target image and the image to be processed;

An transition processing subunit configured to perform transition processing on the connection region based on the image to be processed and the target image, the transition processing including at least one of: color transition, line transition.

17. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 8.

18. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1 to 8.