CN110619362B

CN110619362B - Video content comparison method and device based on perception and aberration

Info

Publication number: CN110619362B
Application number: CN201910874788.1A
Authority: CN
Inventors: 姜卫平; 李国华; 郭忠武; 王荣芳; 纪军; 韩煜
Original assignee: Beijing Bohui Technology Inc
Current assignee: Beijing Bohui Technology Inc
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2021-11-09
Anticipated expiration: 2039-09-17
Also published as: CN110619362A

Abstract

The invention discloses a method and device for comparing video content based on perception and disparity. The method includes: acquiring a source video image sequence and a target video image sequence; extracting perceptual hash features and disparity features; Determine whether the aligned source video image sequence and the target video image sequence are similar matches; if the aligned source video image sequence and the target video image sequence are similar and match, according to the image sequence of the source video image sequence Difference feature and the disparity feature of the target video image sequence, determine whether the similar matched source video image sequence and the target video image sequence exactly match; output the comparison result generated based on the judgment result. Compared with the prior art, this embodiment combines perceptual hashing and aberration algorithms, and adopts a step-by-step matching judgment method for the overall image, which can meet the requirements of low computational complexity and high accuracy of video content comparison at the same time.

Description

Video content comparison method and device based on perception and aberration

Technical Field

The invention relates to the technical field of image recognition, in particular to a video content comparison method and device based on perception and aberration.

Background

Video content in television is typically transmitted by broadcast television signals. In the transmission process of the broadcast television signal, the video content may be illegally tampered, broadcast by mistake, failed in signal transmission, and the like, which affect the quality of the video content. Monitoring video content is therefore particularly important to reduce video content errors. The content comparison of video signals of different links is a direct and effective monitoring technical means; in addition, in service scenes such as video retrieval, video advertisement monitoring and the like, whether a sample video clip appears, and the position and the time length of the appearance of the sample video clip are retrieved by comparing the contents of the sample video clip and a target video, and the method is also another important application of the video content comparison technology.

Because the amount of data to be processed in the video comparison process is large, the video comparison method is required to have not only good distortion robustness but also high computational performance. At present, the video content comparison technology is mainly realized by video image matching, and common methods include perceptual hashing (PHash), color moment (ColorMoment), Scale Invariant Feature Transform (SIFT), Convolutional Neural Network (CNN), peak signal-to-noise ratio (PSNR), and the like. Some of the methods, such as the SIFT method, although the SIFT method has good accuracy, the computation complexity is high, and only the local features of the image are targeted. The existing video comparison technology cannot simultaneously meet the requirements of low computation complexity and high accuracy.

Disclosure of Invention

The embodiment of the invention provides a video content comparison method and device based on perception and aberration, and aims to solve the problem that the conventional image comparison method cannot meet the requirements of low computation complexity and high accuracy at the same time.

In one aspect, an embodiment of the present invention provides a method for comparing video content based on perception and disparity, where the method includes: acquiring a source video image sequence and a target video image sequence; extracting perceptual hash characteristics and aberration characteristics of the source video image sequence and the target video image sequence; calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; aligning the source video image sequence and the target video image sequence according to the alignment offset; judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; if the aligned source video image sequence and the aligned target video image sequence are matched similarly, judging whether the source video image sequence and the target video image sequence which are matched similarly are matched accurately or not according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence; and outputting a comparison result generated based on the judgment result.

With reference to the aspect, in a first possible implementation manner, the extracting the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence includes: extracting the gray value of the source video image sequence and the gray value of the target video image sequence to generate a source video gray image sequence and a target video gray image sequence; scaling the source video grayscale image sequence and the target video grayscale image sequence; performing discrete cosine transform on the scaled source video gray level image sequence and the scaled target video gray level image sequence; respectively taking the coefficients of the upper left corner blocks of the source video gray level image sequence and the target video gray level image sequence after discrete cosine transform; calculating a coefficient mean of the block coefficients; judging the magnitude relation between each coefficient value in the block coefficients and the coefficient mean value, if the coefficient value is greater than or equal to the coefficient mean value, marking the coefficient point as 1, and if the coefficient value is less than the coefficient mean value, marking the coefficient point as 0; and generating the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence according to the marking result.

With reference to the aspect, in a second possible implementation manner, the extracting the disparity feature of the source video image sequence and the disparity feature of the target video image sequence includes: extracting the gray value of the source video image sequence and the gray value of the target video image sequence to generate a source video gray image sequence and a target video gray image sequence; scaling the source video grayscale image sequence and the target video grayscale image sequence; dividing the scaled source video grayscale image sequence and the scaled target video grayscale image sequence into N pixel blocks; calculating the block pixel mean value of all pixel points in the N pixel blocks; calculating the aberration of the gray value of each pixel point and the mean value of the pixels, if the aberration is greater than or equal to 0, marking the pixel point as 1, and if the aberration is less than 0, marking the pixel point as 0; and generating N block aberration characteristics of the source video image sequence and N block aberration characteristics of the target video image sequence according to the marking result.

With reference to the second possible implementation manner, in a third possible implementation manner, the determining whether the source video image sequence and the target video image sequence after the similar matching are accurately matched according to the aberration feature of the source video image sequence and the aberration feature of the target video image sequence includes: generating a matching result between a pixel block of the source video image sequence and a pixel block of the target video image sequence according to the N block aberration characteristics of the source video image sequence and the N block aberration characteristics of the target video image sequence; calculating the ratio of the matched pixel block to the pixel block according to the matching result; calculating the average matching block percentage of the source video image sequence and the target video image sequence according to the ratio; and judging whether the source video image sequence and the target video image sequence which are subjected to similar matching are accurately matched or not according to the size relation between the average matching block percentage and the preset matching block percentage.

With reference to the third possible implementation manner, in a fourth possible implementation manner, the determining whether the source video image sequence and the target video image sequence after the similar matching are accurately matched according to a size relationship between the average matching block percentage and a preset matching block percentage includes: if the average matching block percentage is smaller than the preset matching block percentage, the source video image sequence and the target video image sequence which are subjected to similar matching are not matched accurately; and if the average matching block percentage is greater than or equal to the preset matching block percentage, the source video image sequence and the target video image sequence which are subjected to similar matching are accurately matched.

With reference to the aspect, in a fifth possible implementation manner, calculating alignment offsets of the source video image sequence and the target video image sequence according to the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence includes: selecting a plurality of preselected alignment offsets; aligning the source video image sequence and the target video image sequence according to the preselected alignment offset; calculating the average Hamming distance between the aligned source video image sequence and the aligned target video image sequence according to the perceptual Hash characteristics of the source video image sequence and the perceptual Hash characteristics of the target video image sequence; and selecting the preselected alignment offset with the minimum average Hamming distance as an alignment offset.

With reference to the aspect, in a sixth possible implementation manner, the determining, according to the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence, whether the aligned source video image sequence and the aligned target video image sequence are similar to each other includes: calculating the average Hamming distance of the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; and judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the size relation between the average Hamming distance and a preset threshold value.

With reference to the sixth possible implementation manner, in a seventh possible implementation manner, the determining whether the aligned source video image sequence and the aligned target video image sequence are similar to each other according to a size relationship between the average hamming distance and a preset threshold includes: if the average Hamming distance is greater than the preset threshold, the aligned source video image sequence and the target video image sequence are not matched in a similar way; and if the average Hamming distance is smaller than or equal to the preset threshold, the aligned source video image sequence and the target video image sequence are matched in a similar way.

With reference to the aspect, in an eighth possible implementation manner, calculating alignment offsets of the source video image sequence and the target video image sequence according to the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence further includes: selecting the source video image sequence and the target video image sequence within a preset time length; and calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence within a preset time length.

In a second aspect of the embodiments of the present disclosure, there is provided a video content comparison apparatus based on perception and aberration, including:

the image sequence acquisition unit is used for acquiring a source video image sequence and a target video image sequence;

the extraction unit is used for extracting the perceptual hash characteristics and the aberration characteristics of the source video image sequence and the target video image sequence;

the computing unit is used for computing the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence;

an alignment unit, configured to align the source video image sequence and the target video image sequence according to the alignment offset;

the first judgment unit is used for judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence;

a second judging unit, configured to, if the aligned source video image sequence and the aligned target video image sequence are similar, judge whether the source video image sequence and the target video image sequence that are similar to each other are accurately matched according to an aberration feature of the source video image sequence and an aberration feature of the target video image sequence;

and the result output unit is used for outputting a comparison result generated based on the judgment result.

In a third aspect of the embodiments of the present disclosure, a terminal is provided, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring a source video image sequence and a target video image sequence;

extracting perceptual hash characteristics and aberration characteristics of the source video image sequence and the target video image sequence;

calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence;

aligning the source video image sequence and the target video image sequence according to the alignment offset;

judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence;

if the aligned source video image sequence and the aligned target video image sequence are matched similarly, judging whether the source video image sequence and the target video image sequence which are matched similarly are matched accurately or not according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence;

and outputting a comparison result generated based on the judgment result.

It can be seen from the above embodiments that a source video image sequence and a target video image sequence are obtained; extracting perceptual hash characteristics and aberration characteristics of the source video image sequence and the target video image sequence; calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; aligning the source video image sequence and the target video image sequence according to the alignment offset; judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; if the aligned source video image sequence and the aligned target video image sequence are matched similarly, judging whether the source video image sequence and the target video image sequence which are matched similarly are matched accurately or not according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence; and outputting a comparison result generated based on the judgment result. Compared with the prior art, the embodiment can simultaneously meet the requirements of low computation complexity and high accuracy of video content comparison by combining the perceptual hashing and the aberration algorithm and adopting a step matching judgment method for the whole image.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a flowchart of an embodiment of a method for perceptual-disparity based video content comparison according to the present invention;

fig. 2 is a flow chart of perceptual hash feature extraction for the source video image sequence and the target video image sequence;

FIG. 3 is a flowchart illustrating aberration feature extraction for the source video image sequence and the target video image sequence;

FIG. 4 is a diagram illustrating one embodiment of aligning the sequence of source video images and the sequence of target video images according to the alignment offset;

FIG. 5 is a flowchart illustrating a process of determining whether the source video image sequence and the target video image sequence are similar according to perceptual hash characteristics;

FIG. 6 is a flowchart illustrating determining whether the source video image sequence and the target video image sequence are exactly matched according to aberration characteristics;

FIG. 7 is a diagram illustrating a video content comparison apparatus based on perception and disparity according to an embodiment of the present invention;

fig. 8 is a block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow chart of an embodiment of the present invention is a method for comparing video content based on perception and disparity, the method comprising the following steps:

step 101, a source video image sequence and a target video image sequence are obtained.

Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. When the continuous image changes more than 24 frames of pictures per second, human eyes cannot distinguish a single static picture according to the persistence of vision principle; it appears as a smooth continuous visual effect, so that the continuous picture is called a video. In the embodiment of the present application, a source video image sequence refers to an initial sample video segment before transmission of a broadcast television signal, a target video image sequence refers to a target video segment after transmission of the broadcast television signal, and both the source video image sequence and the target video image sequence include a plurality of video frames. According to the method and the device, whether the target video image sequence obtained by transmitting the source video image sequence generates content errors or quality damage, namely distortion or not is judged by comparing the single video frame in the source video image sequence with the corresponding single video frame in the target video image sequence. The method comprises the steps of acquiring a source video image sequence by carrying out video acquisition and decoding on a source video, and acquiring a target video image sequence by acquiring and decoding a target video.

And 102, extracting the perceptual hash characteristics and the aberration characteristics of the source video image sequence and the target video image sequence.

In this embodiment, specifically, referring to fig. 2, the extracting the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence includes the following steps:

step 201: and extracting the gray value of the source video image sequence and the gray value of the target video image sequence to generate a source video gray image sequence and a target video gray image sequence.

The color image is changed to a gray scale format, which is to throw away color information of the image and express brightness information of the image by gray scale. Each pixel of the color image takes 3 bytes, and after the color image is changed into a gray image, each pixel takes one byte, and the gray value of the pixel is the brightness of the current color image pixel. The grayscale values of the source video image sequence and the grayscale values of the target video image sequence can be extracted by using an image processing tool, i.e. grayscale image frames of the source video image sequence and grayscale image frames of the target video image sequence are generated.

Step 202: scaling the source video grayscale image sequence and the target video grayscale image sequence. Before extracting the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence, the source video image sequence and the target video image sequence may be scaled, that is, all video frames in the source video image sequence and all video frames in the target video image sequence may be scaled, in order to facilitate the comparison process and reduce the calculation amount in the subsequent comparison process. The gray-scale image comprises a region with large brightness change, such as an object edge, namely a region with large brightness change, the region with large brightness change is called a high-frequency region, the region with small brightness change is called a low-frequency region, the gray-scale image scaling process is a process of losing high-frequency information, and image frames of a source video gray-scale image sequence and an object video gray-scale image sequence are scaled to 32 × 32 resolution.

Step 203: and performing discrete cosine transform on the scaled source video gray level image sequence and the scaled target video gray level image sequence. The discrete cosine transform is carried out on the image, the energy of the discrete cosine transform system after the transform is mainly concentrated at the upper left corner, and most of the rest coefficients are close to zero, so that the discrete cosine transform can concentrate the image information, and the transformed image is convenient for subsequent processing.

Step 204: and respectively taking the coefficients of the upper left corner block of the source video gray level image sequence and the target video gray level image sequence after discrete cosine transform. Discrete cosine transform is to process a two-dimensional pixel array, namely time-frequency transform of an image, wherein image information after the discrete cosine transform is concentrated at the upper left corner, 8 x 8 block coefficients at the upper left corner of the transformed image can be taken, and the coefficient at the central coordinate [0,0] of the image is set as 0.

Step 205: and calculating the coefficient mean value of the block coefficients. For example, the coefficient mean of all the coefficients in the 8 × 8 block coefficients is calculated as a.

Step 206: and judging the size relation between each coefficient value in the block coefficient and the coefficient mean value, if the coefficient value is greater than or equal to the coefficient mean value, marking the coefficient as 1, and if the coefficient value is less than the coefficient mean value, marking the coefficient as 0. By judging the magnitude relation between the coefficient value X of each coefficient and the coefficient mean value A, if X is larger than or equal to A, the coefficient is marked as 1, and if X is smaller than A, the coefficient is marked as 0.

Step 207: and generating the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence according to the marking result. Specifically, the marking is performed according to the sequence of 8 × 8 block coefficients to form 64bit values, and the 64bit values formed by the marking are used as the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence.

In this embodiment, specifically, referring to fig. 3, the step of extracting the aberration feature of the source video image sequence and the aberration feature of the target video image sequence includes the following steps:

step 301: and extracting the gray value of the source video image sequence and the gray value of the target video image sequence to generate a source video gray image sequence and a target video gray image sequence.

Step 302: scaling the source video grayscale image sequence and the target video grayscale image sequence may scale both image frames of the source video image sequence and image frames of the target video image sequence to 176 × 144 resolution.

Step 303: and dividing the scaled source video gray scale image sequence and the target video gray scale image sequence into N pixel blocks. Specifically, the source video grayscale image sequence and the target video grayscale image sequence may be divided into 8 × 8 pixel blocks, and each image frame of the source video grayscale image sequence and each image frame of the target video grayscale image sequence have (176/8) × (144/8) × 396 blocks.

Step 304: and calculating the block pixel mean value of all pixel points in the N pixel blocks. The block pixel mean value of all pixel points in the 396 pixel blocks can be calculated to obtain a 396 block pixel mean value A₁、A₂…A₃₉₆。

Step 305: and calculating the aberration of the gray value of each pixel point and the mean value of the pixels, if the aberration is greater than or equal to 0, marking the pixel point as 1, and if the aberration is less than 0, marking the pixel point as 0.

Step 306: and generating N block aberration characteristics of the source video image sequence and N block aberration characteristics of the target video image sequence according to the marking result. Finally, each block pixel obtains 64bit values, and 396 block aberration characteristics are obtained.

Obtaining the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence according to steps 201 to 207, and obtaining the disparity feature of the source video image sequence and the disparity feature of the target video image sequence according to steps 301 to 306.

And 103, calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence.

And 104, aligning the source video image sequence and the target video image sequence according to the alignment offset.

Fig. 4 shows an example of aligning the sequence of source video images and the sequence of target video images according to the alignment offset. First, a plurality of preselected alignment offsets O are selected₁、O₂…O_m. According to the preselected alignment offset O₁、O₂…O_mAnd respectively aligning the source video image sequence and the target video image sequence. After aligning the source video image sequence and the target video image sequence, calculating an average hamming distance between the aligned source video image sequence and the aligned target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence, where an expression formula of the average hamming distance may be: #

Wherein H_i、H’_i+oThe perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence are respectively. DH is the hamming distance of the perceptual hash feature of the aligned image frame, i.e. the number of different bits in the 2 groups of 64-bit features. And N is the number of the aligned image frames in the alignment window. In this embodiment, the average hamming distance may also be calculated by taking an alignment window with a certain time length, for example, a video frame in a 5-second window may be intercepted, and the window is configured to contain M frames of imagesAnd then N ═ M-O. Finally obtaining the preselected alignment offset O₁、O₂…O_mThe following perceptual hash feature average hamming distance:

taking the offset with the minimum distance and the alignment frame number N being more than or equal to 10 as the video synchronization alignment offset, and recording as O_o. According to the obtained offset O_oAligning the source video image sequence and the target video image sequence.

Calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence may further include: selecting the source video image sequence and the target video image sequence within a preset time length; and calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence within a preset time length. By narrowing the comparison range, both accurate alignment offset and reduced calculation amount can be obtained.

And 105, judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence. Specifically, the step of judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence includes the following steps, as shown in fig. 5:

step 501: and calculating the average Hamming distance of the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence.

Step 502: and judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the size relation between the average Hamming distance and a preset threshold value. For example, a preset threshold value T1 is selected, and if the calculated average hamming distance of the perceptual hash features is greater than the preset threshold value T1, it is determined that the source video image sequence and the target video image sequence in the window are not similar and matched. And if the calculated average Hamming distance of the perceptual hash feature is smaller than the preset threshold value T1, judging that the source video image sequence and the target video image sequence in the window are similar to each other, and further performing accurate matching judgment according to the aberration feature.

And 106, if the aligned source video image sequence and the aligned target video image sequence are matched in a similar manner, judging whether the source video image sequence and the target video image sequence which are matched in a similar manner are matched accurately according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence. And if the source video image sequence and the target video image sequence after alignment are judged to be matched similarly according to the perceptual hash feature of the source video image sequence and the perceptual hash feature of the target video image sequence, judging whether the source video image sequence and the target video image sequence after similar matching are matched accurately or not according to the aberration feature of the source video image sequence and the aberration feature of the target video image sequence. Specifically, the step of judging whether the source video image sequence and the target video image sequence after similar matching are accurately matched according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence includes the following steps, as shown in fig. 6:

step 601: and generating a matching result between the pixel blocks of the source video image sequence and the pixel blocks of the target video image sequence after similar matching according to the N block aberration characteristics of the source video image sequence and the N block aberration characteristics of the target video image sequence. For example, for the 64-bit aberration feature of each 8 × 8 pixel block, a hamming distance (the number of different bits in 2 groups of 64 bits) is calculated, and if the hamming distance is greater than a second preset threshold T2, it can be determined that the corresponding block contents do not match, and if the hamming distance is less than or equal to a second preset threshold T2, it can be determined that the corresponding block contents match.

Step 602: and calculating the ratio of the matched pixel blocks to the pixel blocks according to the matching results of the N pixel blocks. According to the number Y of the matched pixel blocks obtained in the step 601, the percentage of the pixel blocks with matched contents in the total number of the pixel blocks in the N pixel blocks of each frame image can be calculated. For example, if the number Y of matched pixel blocks is 360 and the total number N of blocks is 396, the percentage of the number of matched pixel blocks to the total number of blocks is 360/396 ≈ 91%.

Step 603: and calculating the average matching block percentage of the source video image sequence and the target video image sequence according to the ratio. The average matching block percentage is the mean Y of the matching block percentages of all the frame images₁。

Step 604: and judging whether the source video image sequence and the target video image sequence which are subjected to similar matching are accurately matched or not according to the size relation between the average matching block percentage and the preset matching block percentage. Percentage threshold T₃If Y is₁Less than threshold T₃Then judging that the source video image sequence and the target video image sequence after similar matching are not matched accurately, if Y is the same, judging that the source video image sequence and the target video image sequence after similar matching are not matched accurately₁Greater than or equal to a threshold value T₃And judging that the source video image sequence and the target video image sequence after similar matching are accurately matched.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

In addition, as an implementation of the foregoing embodiments, the present disclosure further provides a video content comparison apparatus based on perception and disparity, where the apparatus is located in a terminal, as shown in fig. 7, and the apparatus includes: an image sequence acquisition unit 10, an extraction unit 20, a calculation unit 30, an alignment unit 40, a first judgment unit 50, a second judgment unit 60, and a result output unit 70, wherein:

the image sequence acquisition unit 10 is configured to acquire a source video image sequence and a target video image sequence;

the extraction unit 20 is configured to extract perceptual hash features and disparity features of the source video image sequence and the target video image sequence;

the calculating unit 30 is configured to calculate alignment offsets of the source video image sequence and the target video image sequence according to the perceptual hash features of the source video image sequence and the perceptual hash features of the target video image sequence;

the alignment unit 40 is configured to align the sequence of source video images and the sequence of target video images according to the alignment offset;

the first judging unit 50 is configured to judge whether the aligned source video image sequence and the target video image sequence are similar to each other according to the perceptual hash features of the source video image sequence and the perceptual hash features of the target video image sequence;

the second determining unit 60 is configured to determine whether the source video image sequence and the target video image sequence after similar matching are exactly matched according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence if the source video image sequence and the target video image sequence after alignment are similar;

the result output unit 70 is configured to output the comparison result generated based on the determination result.

The video content comparison device based on perception and aberration provided by the embodiment of the disclosure obtains a source video image sequence and a target video image sequence; extracting perceptual hash characteristics and aberration characteristics of the source video image sequence and the target video image sequence; calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; aligning the source video image sequence and the target video image sequence according to the alignment offset; judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence; if the aligned source video image sequence and the aligned target video image sequence are matched similarly, judging whether the source video image sequence and the target video image sequence which are matched similarly are matched accurately or not according to the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence; and outputting a comparison result generated based on the judgment result. The embodiment of the disclosure combines perceptual hashing and an aberration algorithm, and adopts a step matching judgment method, so that the requirements of low computation complexity and high accuracy of video content comparison can be met simultaneously.

Fig. 8 is a block diagram illustrating a terminal 800 according to an example embodiment. For example, the terminal 800 can be a messaging device, a tablet device, a computer, and the like. Referring to fig. 8, the terminal 800 may include one or more of the following components, a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input-output interface 812, and a communications component 814.

The processing component 802 generally controls overall operation of the terminal 800, such as displaying, data communicating, recording, and the like associated operations, and the processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the terminal 800. Power components 806 provide power to the various components of terminal 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 800. The multimedia component 808 includes a screen providing an output interface between the terminal 800 and the user. The audio component 810 is configured to output and/or input audio signals. The input-output interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Communication component 816 is configured to facilitate communications between terminal 800 and other devices in a wired or wireless manner. The terminal 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the terminal 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of a terminal 800, enable the terminal 800 to perform an information processing method, the method comprising:

acquiring a source video image sequence and a target video image sequence;

and outputting a comparison result generated based on the judgment result.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for comparing video content based on perception and aberration, comprising:

acquiring a source video image sequence and a target video image sequence;

outputting a comparison result generated based on the judgment result;

extracting the aberration characteristics of the source video image sequence and the aberration characteristics of the target video image sequence comprises:

extracting the gray value of the source video image sequence and the gray value of the target video image sequence to generate a source video gray image sequence and a target video gray image sequence;

scaling the source video grayscale image sequence and the target video grayscale image sequence;

dividing the scaled source video grayscale image sequence and the scaled target video grayscale image sequence into N pixel blocks;

calculating the block pixel mean value of all pixel points in the N pixel blocks;

calculating the aberration of the gray value of each pixel point and the mean value of the pixels, if the aberration is greater than or equal to 0, marking the pixel point as 1, and if the aberration is less than 0, marking the pixel point as 0;

and generating N block aberration characteristics of the source video image sequence and N block aberration characteristics of the target video image sequence according to the marking result.

2. The method of claim 1, wherein extracting perceptual hash features of the sequence of source video images and perceptual hash features of the sequence of target video images comprises:

performing discrete cosine transform on the scaled source video gray level image sequence and the scaled target video gray level image sequence;

respectively taking the coefficients of the upper left corner blocks of the source video gray level image sequence and the target video gray level image sequence after discrete cosine transform;

calculating a coefficient mean of the block coefficients;

judging the size relation between each coefficient value in the block coefficient and the coefficient mean value, if the coefficient value is larger than or equal to the coefficient mean value, marking the coefficient as 1, and if the coefficient value is smaller than the coefficient mean value, marking the coefficient as 0;

and generating the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence according to the marking result.

3. The method of claim 1, wherein determining whether the source video image sequence and the target video image sequence after similar matching are exactly matched according to the disparity features of the source video image sequence and the disparity features of the target video image sequence comprises:

generating a matching result between a pixel block of the source video image sequence and a pixel block of the target video image sequence according to the N block aberration characteristics of the source video image sequence and the N block aberration characteristics of the target video image sequence;

calculating the ratio of the number of the matched pixel blocks to the total number of the pixel blocks according to the matching result;

calculating the average matching block percentage of the source video image sequence and the target video image sequence according to the ratio;

and judging whether the source video image sequence and the target video image sequence which are subjected to similar matching are accurately matched or not according to the size relation between the average matching block percentage and the preset matching block percentage.

4. The method of claim 3, wherein determining whether the source video image sequence and the target video image sequence after similar matching are exactly matched according to a size relationship between the average matching block percentage and a preset matching block percentage comprises:

if the average matching block percentage is smaller than the preset matching block percentage, the source video image sequence and the target video image sequence which are subjected to similar matching are not matched accurately;

and if the average matching block percentage is greater than or equal to the preset matching block percentage, the source video image sequence and the target video image sequence which are subjected to similar matching are accurately matched.

5. The method of claim 1, wherein calculating the alignment offset between the source video image sequence and the target video image sequence according to the perceptual hash characteristic of the source video image sequence and the perceptual hash characteristic of the target video image sequence comprises:

selecting a plurality of preselected alignment offsets;

aligning the source video image sequence and the target video image sequence according to the preselected alignment offset;

calculating the average Hamming distance between the aligned source video image sequence and the aligned target video image sequence according to the perceptual Hash characteristics of the source video image sequence and the perceptual Hash characteristics of the target video image sequence;

and selecting the preselected alignment offset with the minimum average Hamming distance as an alignment offset.

6. The method of claim 1, wherein determining whether the aligned source video image sequence and the target video image sequence are similar to each other according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence comprises:

calculating the average Hamming distance of the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence;

and judging whether the aligned source video image sequence and the aligned target video image sequence are similar to each other or not according to the size relation between the average Hamming distance and a preset threshold value.

7. The method of claim 6, wherein determining whether the aligned source video image sequence and the target video image sequence are similar to each other according to a magnitude relationship between the average Hamming distance and a preset threshold comprises:

if the average Hamming distance is greater than the preset threshold, the aligned source video image sequence and the target video image sequence are not matched in a similar way;

and if the average Hamming distance is smaller than or equal to the preset threshold, the aligned source video image sequence and the target video image sequence are matched in a similar way.

8. The method of claim 1, wherein calculating the alignment offset for the sequence of source video images and the sequence of target video images based on the perceptual hash characteristic of the sequence of source video images and the perceptual hash characteristic of the sequence of target video images further comprises:

selecting the source video image sequence and the target video image sequence within a preset time length;

and calculating the alignment offset of the source video image sequence and the target video image sequence according to the perceptual hash characteristics of the source video image sequence and the perceptual hash characteristics of the target video image sequence within a preset time length.

9. A perceptual-disparity based video content comparison apparatus, comprising:

a result output unit for outputting a comparison result generated based on the determination result;

10. A terminal, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring a source video image sequence and a target video image sequence;

outputting a comparison result generated based on the judgment result;