CN109257584B

CN109257584B - User watching viewpoint sequence prediction method for 360-degree video transmission

Info

Publication number: CN109257584B
Application number: CN201810886661.7A
Authority: CN
Inventors: 邹君妮; 杨琴; 刘昕; 李成林; 熊红凯
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2020-03-10
Anticipated expiration: 2038-08-06
Also published as: CN109257584A

Abstract

The invention provides a method for predicting a view point sequence watched by a user in 360-degree video transmission, which comprises the following steps: using the viewpoint positions of the past time of a user as the input of a viewpoint sequence prediction model, predicting the viewpoint positions of a plurality of future times through the viewpoint sequence prediction model, wherein the viewpoint positions of the plurality of future times form a first viewpoint sequence; using video content as input of a viewpoint tracking model through the viewpoint tracking model, and predicting viewpoint positions at a plurality of moments in the future through the viewpoint tracking model, wherein the viewpoint positions at the plurality of moments in the future form a second viewpoint sequence; and determining a future viewing viewpoint sequence of the user by combining the first viewpoint sequence and the second viewpoint sequence. The prediction method in the invention has good practicability and expansibility, and can change the sequence length of the prediction viewpoint according to the head movement speed of the user.

Description

Prediction method of user viewing viewpoint sequence for 360-degree video transmission

技术领域technical field

本发明涉及视频通信技术领域，具体地，涉及360度视频传输的用户观看视点序列预测方法。The present invention relates to the technical field of video communication, and in particular, to a method for predicting user viewing viewpoint sequences for 360-degree video transmission.

背景技术Background technique

360度视频是虚拟现实技术的一种重要应用，与传统视频相比，360度视频采用全方位摄像头捕捉现实世界每个方位景象，并将这些景象拼接以形成全景图像。当观看360度视频时，用户可以自由转动头部调整观看视角，获得沉浸式体验。然而，360度视频有超高的分辨率，传输完整360度视频需要消耗的带宽高达传统视频的6倍以上。在网络带宽资源受限的情况下，特别是对于移动网络来说，传输完整的360度视频是很困难的。360-degree video is an important application of virtual reality technology. Compared with traditional video, 360-degree video uses omnidirectional cameras to capture scenes in every direction of the real world, and stitches these scenes to form panoramic images. When watching 360-degree videos, users can freely turn their heads to adjust the viewing angle for an immersive experience. However, 360-degree video has ultra-high resolution, and the bandwidth required to transmit a full 360-degree video is more than 6 times that of traditional video. In the case of limited network bandwidth resources, especially for mobile networks, it is difficult to transmit full 360-degree video.

受限于头戴式显示器的视场区域，每个时刻用户只能观看360度视频的一部分。因此根据用户头部运动选择用户感兴趣的视频区域进行传输能够更加有效利用带宽。从获取用户的需求信息，并将这一信息反馈至服务器端，直至用户接收到视频内容，会经历从用户到服务器的往返时延(Round-Trip Time，RTT)。而用户在这一时间段内可能已经发生了头部位置移动，导致用户接收到的内容不再是其感兴趣的部分。为了避免RTT时延带来的传输滞后性，需要对用户的视点进行预测。Limited by the field of view of the head-mounted display, the user can only view a portion of the 360-degree video at each moment. Therefore, selecting the video region that the user is interested in for transmission according to the user's head movement can more effectively utilize the bandwidth. From acquiring the user's demand information and feeding this information back to the server, until the user receives the video content, a round-trip delay (Round-Trip Time, RTT) from the user to the server will be experienced. However, the user may have moved the head position during this time period, so that the content received by the user is no longer the part of interest to the user. In order to avoid the transmission lag caused by the RTT delay, it is necessary to predict the user's viewpoint.

经过对现有技术的检索发现，为了实现用户的视点预测，一种常用方法是通过对过去时刻的视点位置来推断未来时刻的视点位置。Y.Bao等人在《IEEE InternationalConference on Big Data》会议上发表了题为“Shooting a moving target:Motion-prediction-based transmission for 360-degree videos”的文章，该文章提出了直接将当前时刻的视点位置作为未来时刻的视点位置的简单模型，及采用线性回归、前馈神经网络对用户视点位置随时间的变化关系进行回归分析，进而预测未来时刻的视点位置等三种回归模型。但是，诸如用户职业，年龄，性别，偏好等因素会影响用户对于360度视频的感兴趣区域，未来时刻的视点位置与过去时刻的视点位置之间的关系可以表征为非线性和长时间依赖关系，该文章提出的三种预测模型只能预测单个视点位置，无法预测未来多个时刻的视点位置。After searching the prior art, it is found that, in order to realize the user's viewpoint prediction, a common method is to infer the viewpoint position of the future time by the viewpoint position of the past time. Y. Bao et al. published an article entitled "Shooting a moving target: Motion-prediction-based transmission for 360-degree videos" at the "IEEE International Conference on Big Data", which proposes a direct view of the current moment The position is used as a simple model of the viewpoint position in the future time, and the linear regression and feedforward neural network are used to perform regression analysis on the change relationship of the user's viewpoint position with time, and then predict the viewpoint position in the future time. There are three regression models. However, factors such as the user's occupation, age, gender, preference, etc. will affect the user's area of interest for 360-degree videos, and the relationship between the viewpoint position at the future moment and the viewpoint position at the past moment can be characterized as non-linear and long-term dependency , the three prediction models proposed in this article can only predict the position of a single viewpoint, but cannot predict the viewpoint positions at multiple times in the future.

经检索还发现，A.D.Aladagli等人在《International Conference on 3dImmersion,2018,pp.1-6》发表了题为“Predicting head trajectories in 360virtualreality videos”的文章，该文章考虑了视频内容对于用户视点位置的影响，基于显著性算法预测视频的显著性区域，以此预测用户视点位置。但是，该文章没有考虑过去时刻视点位置对于观看视点的影响。The search also found that A.D.Aladagli et al. published an article entitled "Predicting head trajectories in 360virtualreality videos" in "International Conference on 3dImmersion, 2018, pp.1-6", which considered the impact of video content on the user's viewpoint position. Influence, the saliency region of the video is predicted based on the saliency algorithm to predict the position of the user's viewpoint. However, this article does not consider the influence of the viewpoint position on the viewing viewpoint in the past time.

发明内容SUMMARY OF THE INVENTION

针对现有技术中的缺陷，本发明的目的是提供一种360度视频传输的用户观看视点序列预测方法。Aiming at the defects in the prior art, the purpose of the present invention is to provide a method for predicting a user's viewing viewpoint sequence for 360-degree video transmission.

本发明提供一种360度视频传输的用户观看视点序列预测方法，包括：The present invention provides a user viewing viewpoint sequence prediction method for 360-degree video transmission, including:

将用户过去时刻的视点位置作为视点序列预测模型的输入，通过所述视点序列预测模型预测未来多个时刻的视点位置，所述未来多个时刻的视点位置构成第一视点序列；Taking the viewpoint position of the user in the past as the input of the viewpoint sequence prediction model, and predicting the viewpoint positions of multiple future moments through the viewpoint sequence prediction model, and the viewpoint positions of the future multiple moments constitute the first viewpoint sequence;

通过视点跟踪模型，将视频内容作为所述视点跟踪模型的输入，通过所述视点跟踪模型预测未来多个时刻的视点位置，所述未来多个时刻的视点位置构成第二视点序列；Using the viewpoint tracking model, the video content is used as the input of the viewpoint tracking model, and the viewpoint positions of multiple future moments are predicted by the viewpoint tracking model, and the viewpoint positions of the multiple future moments constitute a second viewpoint sequence;

结合第一视点序列和第二视点序列，确定用户未来的观看视点序列。Combining the first viewpoint sequence and the second viewpoint sequence, the user's future viewing viewpoint sequence is determined.

可选地，在将用户过去时刻的视点位置作为视点序列预测模型的输入，通过所述视点序列预测模型预测未来多个时刻的视点位置之前，还包括：Optionally, before using the viewpoint position of the user in the past as the input of the viewpoint sequence prediction model, and predicting the viewpoint positions at multiple future moments by using the viewpoint sequence prediction model, the method further includes:

基于循环神经网络构建视点序列预测模型；其中，所述视点序列预测模型用于将输入的视点位置编码后输入到循环神经网络，计算隐藏单元和输出单元的值，学习用户不同时刻的观看视点间的长时间依赖关系，输出未来多个时刻的视点位置；所述视点位置包括：俯仰角、偏航角、滚动角的单位圆投影，所述视点位置的变化范围为-1到1；采用双曲正切函数作为输出单元的激活函数，所述激活函数限定所述视点位置的输出范围。A viewpoint sequence prediction model is constructed based on a recurrent neural network; wherein, the viewpoint sequence prediction model is used to encode the input viewpoint position and input it to the recurrent neural network, calculate the values of the hidden unit and the output unit, and learn the viewing viewpoints of the user at different times. The long-term dependency relationship of , outputs the viewpoint positions at multiple moments in the future; the viewpoint positions include: the unit circle projection of the pitch angle, yaw angle, and roll angle, and the variation range of the viewpoint position is -1 to 1; The tangent function is used as the activation function of the output unit, and the activation function defines the output range of the viewpoint position.

可选地，所述将用户过去时刻的视点位置作为视点序列预测模型的输入，通过所述视点序列预测模型预测未来多个时刻的视点位置，包括：Optionally, the viewpoint position of the user in the past is used as the input of the viewpoint sequence prediction model, and the viewpoint positions at multiple future moments are predicted by the viewpoint sequence prediction model, including:

将用户当前时刻的视点位置作为所述视点序列预测模型第一次迭代的输入，得到第一次迭代的预测视点位置；Taking the viewpoint position of the user at the current moment as the input of the first iteration of the viewpoint sequence prediction model, and obtaining the predicted viewpoint position of the first iteration;

循环将上一次迭代的预测视点位置作为所述视点序列预测模型下一次迭代的输入，得到未来多个时刻的预测视点位置。The loop uses the predicted viewpoint position of the previous iteration as the input of the next iteration of the viewpoint sequence prediction model, and obtains the predicted viewpoint positions at multiple times in the future.

可选地，所述第一视点序列的长度与用户观看时头部运动的速度有关，用户头部运动速度越慢，则对应的第一视点序列的长度越长；用户头部运动速度越快，则对应的第一视点序列的长度越短。Optionally, the length of the first viewpoint sequence is related to the speed of the head movement of the user when viewing, and the slower the head movement speed of the user is, the longer the length of the corresponding first viewpoint sequence; the faster the user head movement speed. , the length of the corresponding first viewpoint sequence is shorter.

可选地，在通过视点跟踪模型，将视频内容作为所述视点跟踪模型的输入，通过所述视点跟踪模型预测未来多个时刻的视点位置之前，还包括：Optionally, before using the viewpoint tracking model to use the video content as the input of the viewpoint tracking model, and predicting the viewpoint positions at multiple future moments by using the viewpoint tracking model, the method further includes:

根据目标跟踪相关的滤波器算法构建视点跟踪模型，其中，所述相关的滤波器算法是指：设置相关滤波器，所述相关滤波器对视点位置的视频区域会形成最大响应值。The viewpoint tracking model is constructed according to a filter algorithm related to target tracking, wherein the related filter algorithm refers to setting a correlation filter, and the correlation filter will form a maximum response value to the video area at the viewpoint position.

可选地，所述通过视点跟踪模型，将视频内容作为所述视点跟踪模型的输入，通过所述视点跟踪模型预测未来多个时刻的视点位置，包括：Optionally, in the viewpoint tracking model, video content is used as the input of the viewpoint tracking model, and the viewpoint positions at multiple future moments are predicted by the viewpoint tracking model, including:

采用等距圆柱投影方式，将未来时刻的360度视频帧的球形图像投影成平面图像；Using the equidistant cylindrical projection method, the spherical image of the 360-degree video frame in the future moment is projected into a plane image;

通过所述视点跟踪模型在所述平面图像中确定一个边界框，所述边界框中的区域即为视点区域，根据所述视点区域确定对应的视点位置。A bounding box is determined in the plane image by the viewpoint tracking model, the area in the bounding box is the viewpoint area, and the corresponding viewpoint position is determined according to the viewpoint area.

可选地，所述结合第一视点序列和第二视点序列，确定用户未来的观看视点序列，包括：Optionally, the combination of the first viewpoint sequence and the second viewpoint sequence to determine the user's future viewing viewpoint sequence includes:

为所述第一视点序列中的视点位置和所述第二视点序列中的视点位置，分别设置不同的权重值w₁和w₂；且权重w₁和w₂满足w₁+w₂＝1；其中：权重值w₁和w₂的设置需要满足预测的用户未来的观看视点位置与实际用户观看视点位置的误差最小原则；Different weight values w ₁ and w ₂ are respectively set for the viewpoint positions in the first viewpoint sequence and the viewpoint positions in the second viewpoint sequence; and the weights w ₁ and w ₂ satisfy w ₁ +w ₂ =1 ; wherein: the setting of the weight values w ₁ and w ₂ needs to satisfy the principle of minimum error between the predicted user's future viewing viewpoint position and the actual user's viewing viewpoint position;

根据权重值w₁和w₂，以及第一视点序列中的视点位置、第二视点序列中的视点位置，计算用户未来的观看视点序列；计算公式中如下：According to the weight values w ₁ and w ₂ , as well as the viewpoint positions in the first viewpoint sequence and the viewpoint positions in the second viewpoint sequence, the user's future viewing viewpoint sequence is calculated; the calculation formula is as follows:

其中：

为t+1时刻到t+t_w时刻用户未来的观看视点位置，w₁为第一视点序列的权重值，

为t+1时刻到t+t_w时刻第一视点序列中的视点位置，w₂为第二视点序列的权重值，

为t+1时刻到t+t_w时刻第二视点序列中的视点位置，⊙表示逐元素相乘运算，t为当前时刻，t_w为预测时间窗。in:

is the user's future viewing viewpoint position from time t+1 to time t+t _w , w ₁ is the weight value of the first viewpoint sequence,

is the viewpoint position in the first viewpoint sequence from time t+1 to time t+t _w , w ₂ is the weight value of the second viewpoint sequence,

is the viewpoint position in the second viewpoint sequence from time t+1 to time t+t _w , ⊙ represents the element-by-element multiplication operation, t is the current time, and t _w is the prediction time window.

可选地，随着预测时刻的增大，所述视点跟踪模型所预测的第二视点序列的权重w₂逐渐减小。Optionally, as the prediction time increases, the weight w ₂ of the second viewpoint sequence predicted by the viewpoint tracking model gradually decreases.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提供的360度视频传输的用户观看视点序列预测方法，结合了循环神经网络学习用户不同时刻的观看视点之间的长时间依赖关系，基于过去时刻的用户视点位置预测未来多个时刻的视点位置；同时考虑了视频内容对观看视点的影响，基于视频内容预测未来的视点序列；最后综合循环神经网络和视频内容对观看视点的影响，得到用户未来的观看视点序列，能够根据用户头部运动速度改变预测视点序列的长度，具有良好的实用性和扩展性。The method for predicting the user's viewing viewpoint sequence for 360-degree video transmission provided by the present invention combines the cyclic neural network to learn the long-term dependency between the viewing viewpoints of the user at different times, and predicts viewpoints at multiple times in the future based on the user viewpoint positions at the past moments. At the same time, the influence of the video content on the viewing viewpoint is considered, and the future viewpoint sequence is predicted based on the video content; finally, the influence of the recurrent neural network and the video content on the viewing viewpoint is integrated to obtain the user's future viewing viewpoint sequence, which can be based on the user's head movement. The speed of changing the length of the predicted viewpoint sequence has good practicability and scalability.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments with reference to the following drawings:

图1为本发明一实施例提供的应用360度视频传输的用户观看视点序列预测方法的系统框图；1 is a system block diagram of a method for predicting a sequence of user viewing viewpoints using 360-degree video transmission according to an embodiment of the present invention;

图2为本发明一实施例提供的视点区域的原理示意图。FIG. 2 is a schematic schematic diagram of a viewpoint area provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below with reference to specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

图1为本发明一实施例提供的应用360度视频传输的用户观看视点序列预测方法的系统框图，如图1所示，包括：基于循环神经网络的视点预测模块，基于相关滤波器的视点跟踪模块，融合模块，其中：循环神经网络视点预测模块，结合循环神经网络学习用户不同时刻的观看视点之间的长时间依赖关系，基于过去时刻的用户视点位置预测未来多个时刻的视点位置；相关滤波器视点跟踪模块，考虑了视频内容对观看视点的影响，提出基于相关滤波器的视点跟踪模块，探索视频内容和视点序列之间的关系，基于视频内容预测未来的视点序列；融合模块，结合了循环神经网络视点预测模块和相关滤波器视点跟踪模块的预测结果，两个模块优势互补，提高了模型的预测准确度。FIG. 1 is a system block diagram of a method for predicting a sequence of viewpoints viewed by a user applying 360-degree video transmission provided by an embodiment of the present invention. As shown in FIG. 1 , it includes: a viewpoint prediction module based on a recurrent neural network, and viewpoint tracking based on a correlation filter. The module, the fusion module, wherein: a recurrent neural network viewpoint prediction module, which combines the recurrent neural network to learn the long-term dependency between the viewing viewpoints of the user at different times, and predicts the viewpoint positions of multiple future moments based on the user viewpoint position at the past moment; related; The filter viewpoint tracking module, which considers the influence of video content on viewing viewpoints, proposes a viewpoint tracking module based on correlation filters, explores the relationship between video content and viewpoint sequences, and predicts future viewpoint sequences based on video content; fusion module, combined with The prediction results of the recurrent neural network viewpoint prediction module and the correlation filter viewpoint tracking module are presented. The two modules complement each other and improve the prediction accuracy of the model.

本实施例中，结合循环神经网络学习用户不同时刻的观看视点之间的长时间依赖关系，基于过去时刻的用户视点位置预测未来多个时刻的视点位置；同时考虑了视频内容对观看视点的影响，提出基于相关滤波器的视点跟踪模块，探索视频内容和视点序列之间的关系，基于视频内容预测未来的视点序列；最后通过融合模块结合了循环神经网络视点预测模块和相关滤波器视点跟踪模块的预测结果，两个模块优势互补，提高了模型的预测准确度。本发明提出的视点序列预测结构，能够根据用户头部运动速度改变预测视点的序列长度，具有良好的实用性和扩展性，为360度视频的高效传输打下了坚实基础。In this embodiment, the long-term dependencies between the viewing viewpoints of the user at different times are learned in combination with the recurrent neural network, and the viewpoint locations at multiple future moments are predicted based on the user viewpoint locations at the past moments; the influence of the video content on the viewing viewpoints is also considered. , propose a viewpoint tracking module based on correlation filter, explore the relationship between video content and viewpoint sequences, and predict future viewpoint sequences based on video content; finally, the fusion module combines the recurrent neural network viewpoint prediction module and correlation filter viewpoint tracking module. The prediction results of the two modules complement each other and improve the prediction accuracy of the model. The viewpoint sequence prediction structure proposed by the present invention can change the sequence length of the predicted viewpoints according to the motion speed of the user's head, has good practicability and expansibility, and lays a solid foundation for the efficient transmission of 360-degree video.

具体地，本实施例中，预测视点位置实际为预测俯仰角(θ)、偏航角

和滚动角(ψ)的单位圆投影位置，其中预测俯仰角(θ)、偏航角

和滚动角(ψ)对应用户头部绕X轴，Y轴和Z轴进行旋转的角度。图2为本发明一实施例提供的视点区域的原理示意图，参见图2；定义用户头部的初始位置的这三个角度均为0度，每个角度变化范围在-180°到180°之间。对于采用头戴式显示器观看视频的用户来说，这三个角度确定了唯一的视点位置，当用户转动头部时，实验表明偏航角

相对于另外两个角度变化最明显，所以最难预测。Specifically, in this embodiment, the predicted viewpoint position is actually the predicted pitch angle (θ), yaw angle

and the unit circle projected position of roll angle (ψ), where pitch angle (θ), yaw angle are predicted

And the roll angle (ψ) corresponds to the angle that the user's head rotates around the X-axis, Y-axis and Z-axis. Fig. 2 is a schematic diagram of the principle of the viewpoint area provided by an embodiment of the present invention, see Fig. 2; the three angles that define the initial position of the user's head are all 0 degrees, and the variation range of each angle is between -180° and 180° between. For users watching videos with a head-mounted display, these three angles determine the unique viewpoint position, and when the user turns their head, experiments show that the yaw angle

Compared to the other two angles, the change is the most obvious, so it is the most difficult to predict.

在本实例中，主要关注对于偏航角

的预测，提出的系统结构可以直接延伸到另外两个角的预测。根据角度定义，-180°和179°相差1°而不是359°，为了避免这个问题，首选需要对预测角度进行变换，采用

作为输入，其中

在输出预测结果之前，对预测得到的V_t做反变换，其中

其中，V_t为t时刻偏航角

经过g函数变换后所得输出向量，

为t时刻偏航角

的正弦值，

为t时刻偏航角

的余弦值，

为t时刻对偏航角

的函数变换。In this example, the main concern is for the yaw angle

For the predictions, the proposed system structure can be directly extended to the predictions of the other two corners. According to the angle definition, the difference between -180° and 179° is 1° instead of 359°. In order to avoid this problem, it is preferred to transform the predicted angle, using

as input, where

Before outputting the predicted result, inverse transform the predicted V _t , where

Among them, V _t is the yaw angle at time t

The output vector obtained after transformation by the g function,

is the yaw angle at time t

the sine of ,

is the yaw angle at time t

the cosine of ,

is the yaw angle at time t

function transformation.

本实施例中，所述循环神经网络视点预测模块采用当前时刻视点位置

作为输入，预测多个时刻偏航角

其中t_w是预测时间窗，

为t+1时刻偏航角的值，

为t+t_w时刻偏航角的值。如果用户头部运动缓慢，可以选择一个较大预测时间窗t_w，反之预测时间窗需要设置一个较小值。在训练过程中，对于每一个时间步i，将

编码为128维向量x_i，i的取值范围为t到t+t_w-1。然后将x_i输入到循环神经网络中，计算隐藏单元h_i，和输出单元

从t到t+t_w-1每个时间步，应用至以下更新方程：In this embodiment, the recurrent neural network viewpoint prediction module adopts the viewpoint position at the current moment

As input, predict the yaw angle at multiple times

where t _w is the prediction time window,

is the value of the yaw angle at time t+1,

is the value of the yaw angle at time t+t _w . If the user's head moves slowly, a larger prediction time window _tw can be selected, otherwise, a smaller value needs to be set for the prediction time window. During training, for each time step i, the

The encoding is a 128-dimensional vector x _i , and the value range of i is from t to t+t _w -1. Then input x _i into the recurrent neural network, compute the hidden unit h _i , and the output unit

For each time step from t to t+t _w -1, apply the following update equation:

h_i＝σ₂(W_hxx_i+W_hhh_i-1+b_h) (2)h _i =σ ₂ (W _hx x _i +W _hh h _i-1 +b _h ) (2)

y_i＝W_ohh_i+b_o (3)y _i =W _oh h _i +b _o (3)

其中，W_xv为将偏航角

编码为128维向量x_i过程的权重矩阵，W_hx为输入单元x_i到隐藏单元h_i连接的权重矩阵，W_hh为i-1时刻的隐藏单元h_i-1到i时刻的隐藏单元h_i连接的权重矩阵，W_oh为隐藏单元h_i到输出单元o_i连接的权重矩阵，b_x为编码过程的偏置向量，b_h为计算隐藏单元h_i的偏置向量，b_o为计算输出单元o_i的偏置向量。在测试过程中，采用当前时刻视点位置

作为第一次迭代的输入，对于其他时间步，采用上一次迭代的预测结果作为下一次迭代的输入，即

σ₁和σ₂是激活函数，其中，σ₁是线性整流函数，σ₂是双曲正切函数；

为i+1时刻用户视点位置的预测值，

为对i时刻偏航角

的函数变换值，h_i-1为i-1时刻的隐藏单元，g^-1(y_i)为对于输出结果y_i的g函数反变换。Among them, W _xv is the yaw angle

Encoded as the weight matrix of the 128-dimensional vector x _i process, W _hx is the weight matrix connecting the input unit x _i to the hidden unit h _i , W _hh is the hidden unit h _i-1 at time i-1 to the hidden unit h at time i The weight matrix connected by _i , W _oh is the weight matrix connected from the hidden unit _{hi to the output unit o i} _, b _x is the bias vector of the encoding process, b _h is the bias vector for calculating the hidden unit _hi , and b _o is the calculation Bias vector for output unit _oi . During the test, the viewpoint position at the current moment is used

As the input of the first iteration, for other time steps, the prediction result of the previous iteration is used as the input of the next iteration, i.e.

σ ₁ and σ ₂ are activation functions, where σ ₁ is a linear rectification function and σ ₂ is a hyperbolic tangent function;

is the predicted value of the user's viewpoint position at time i+1,

is the yaw angle for time i

The function transformation value of , h _i-1 is the hidden unit at time i-1, g ^-1 (y _i ) is the inverse transformation of the g function for the output result y _i .

本实施例中，所述相关滤波器视点跟踪模块根据目标跟踪相关滤波器算法，设计的相关滤波器对于视点所在区域有最大响应，采用未来时刻360度视频帧

作为输入，基于视频内容对视点位置做出预测；其中：F_t+1为t+1时刻360度视频帧，

为t+t_w时刻360度视频帧。由于目标跟踪相关滤波器算法主要用于跟踪视频中的具体物体，本实施例中跟踪的视点相比具体物体更加抽象。因此，需要先采用等距圆柱投影方式将360度视频帧的球形图像投影成平面图像，在平面图像上重新定位视点对应区域。对于投影得到的平面图像来说，靠近极点的图像内容被水平展开，相应的视点对应区域不再是矩形，所以在视点周围设定一个边界框，重新定义视点区域大小和形状。由此，可以基于视频内容预测出视点的边界框，从而预测出视点位置

In this embodiment, the correlation filter viewpoint tracking module designs the correlation filter according to the target tracking correlation filter algorithm to have the maximum response to the region where the viewpoint is located, and uses 360-degree video frames in the future.

As input, the viewpoint position is predicted based on the video content; where: F _t+1 is a 360-degree video frame at time t+1,

is a 360-degree video frame at time t+t _w . Since the target tracking correlation filter algorithm is mainly used to track specific objects in the video, the tracked viewpoint in this embodiment is more abstract than the specific objects. Therefore, it is necessary to firstly use the equidistant cylindrical projection method to project the spherical image of the 360-degree video frame into a plane image, and relocate the corresponding area of the viewpoint on the plane image. For the plane image obtained by projection, the image content near the pole is expanded horizontally, and the corresponding area of the viewpoint is no longer a rectangle, so a bounding box is set around the viewpoint to redefine the size and shape of the viewpoint area. In this way, the bounding box of the viewpoint can be predicted based on the video content, thereby predicting the position of the viewpoint

本实施例中，结合了循环神经网络视点预测模块和相关滤波器视点跟踪模块的预测结果，赋予不同权重以得到最终预测结果，即

其中，

是最终的预测结果，

和

分别是循环神经网络视点预测模块和相关滤波器视点跟踪模块的预测结果，⊙和逐元素相乘，权重w₁和w₂满足w₁+w₂＝1，采用使得最终的视点位置预测值误差最小的权重值。其中，所述相关滤波器视点跟踪模块，滤波器无法进行更新，视点估计值和真实值之间的差距随着误差累积逐渐增加，对于大的预测窗，相关滤波器视点跟踪的预测结果的权重逐渐减小。将基于循环神经网络的视点序列预测模块和基于相关滤波器的视点跟踪系统模块优势互补，提高了预测准确度。In this embodiment, the prediction results of the recurrent neural network viewpoint prediction module and the correlation filter viewpoint tracking module are combined, and different weights are assigned to obtain the final prediction result, that is,

in,

is the final prediction result,

and

are the prediction results of the recurrent neural network viewpoint prediction module and the correlation filter viewpoint tracking module, respectively. ⊙ is multiplied element by element, and the weights w ₁ and w ₂ satisfy w ₁ +w ₂ =1, and the error of the final viewpoint position prediction value is adopted. Minimum weight value. Among them, in the correlation filter viewpoint tracking module, the filter cannot be updated, the gap between the viewpoint estimated value and the real value gradually increases with the accumulation of errors, and for a large prediction window, the weight of the prediction result of the correlation filter viewpoint tracking slowing shrieking. The advantages of the viewpoint sequence prediction module based on the recurrent neural network and the viewpoint tracking system module based on the correlation filter are complementary, and the prediction accuracy is improved.

本实施例中关键参数的设置为：实验数据来源于Y.Bao等人在《IEEEInternational Conference on Big Data》会议上发表了题为“Shooting a movingtarget:Motion-prediction-based transmission for 360-degree videos”的文章，该数据采集了153个志愿者观看16段360度视频时的头部运动信息，部分志愿者只观看部分视频，共采集985个观看样本。在本实施例数据预处理中，对每个观看样本进行每秒10次的采样，每个观看样本共记录289个运动数据，共得到285665个运动数据。采用80％的运动数据作为训练集，20％的运动数据作为测试集。对所述循环神经网络模块，隐藏单元大小设为256，采用Adam(自适应矩估计)优化方法，动量和权重衰减分别设置为0.8和0.999。批量大小为128，共训练500个周期。学习率在前250个周期训练过程中从0.001到0.0001线性衰减。对所述相关滤波器视点跟踪模块，调整图像大小为1800×900，设置边界框大小为10×10。对于所述融合模块，对w₁和w₂赋不同值，选择使得最终的视点位置预测值误差最小的作为最终权重值。The key parameters in this embodiment are set as follows: the experimental data is derived from the "Shooting a moving target: Motion-prediction-based transmission for 360-degree videos" published by Y. Bao et al. at the "IEEE International Conference on Big Data" conference. The data collected the head motion information of 153 volunteers watching 16 360-degree videos, some volunteers only watched part of the video, and a total of 985 viewing samples were collected. In the data preprocessing in this embodiment, each viewing sample is sampled 10 times per second, a total of 289 pieces of motion data are recorded for each viewing sample, and a total of 285,665 pieces of motion data are obtained. 80% of the motion data is used as the training set and 20% of the motion data is used as the test set. For the RNN module, the hidden unit size is set to 256, the Adam (Adaptive Moment Estimation) optimization method is used, and the momentum and weight decay are set to 0.8 and 0.999, respectively. The batch size is 128 and it is trained for 500 epochs. The learning rate decays linearly from 0.001 to 0.0001 during the first 250 epochs of training. For the correlation filter viewpoint tracking module, adjust the image size to 1800×900, and set the bounding box size to 10×10. For the fusion module, different values are assigned to w ₁ and w ₂ , and the final weight value that minimizes the error of the final viewpoint position prediction value is selected.

本发明为适应360度视频传输中提高带宽利用率的需要，提出了基于用户过去时刻视点位置和360度视频内容的视点序列预测系统。本发明提出的视点序列预测结构，能够预测未来多个时刻用户的视点位置，并且能够根据用户头部运动速度改变预测视点的序列长度，具有良好的实用性和扩展性，为360度视频的高效传输打下了坚实基础。In order to meet the needs of improving bandwidth utilization in 360-degree video transmission, the present invention proposes a viewpoint sequence prediction system based on the user's past moment viewpoint position and 360-degree video content. The viewpoint sequence prediction structure proposed by the present invention can predict the viewpoint positions of users at multiple times in the future, and can change the sequence length of the predicted viewpoints according to the motion speed of the user's head, which has good practicability and expansibility, and is efficient for 360-degree video. Transmission has laid a solid foundation.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变化或修改，这并不影响本发明的实质内容。在不冲突的情况下，本申请的实施例和实施例中的特征可以任意相互组合。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the above-mentioned specific embodiments, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essential content of the present invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily, provided that there is no conflict.

Claims

1. a user viewing viewpoint sequence prediction method of 360-degree video transmission, is characterized in that, comprises:

The viewpoint position of the user in the past time is used as the input of the viewpoint sequence prediction model, and the viewpoint positions of the future multiple times are predicted by the viewpoint sequence prediction model, and the viewpoint positions of the future multiple times predicted by the viewpoint sequence prediction model constitute the first step. A viewpoint sequence; the viewpoint sequence prediction model is constructed based on a recurrent neural network, and is used to encode the input viewpoint position and input it to the recurrent neural network, calculate the values of the hidden unit and the output unit, and learn the length of the viewing viewpoints of the user at different times. Time dependency, output the viewpoint positions at multiple moments in the future; the viewpoint positions include: the unit circle projection of the pitch angle, yaw angle, and roll angle, and the variation range of the viewpoint position is -1 to 1; the hyperbolic tangent is used The function is used as the activation function of the output unit, and the activation function defines the output range of the viewpoint position;

Through the viewpoint tracking model, the video content is used as the input of the viewpoint tracking model, and the viewpoint positions at multiple future moments are predicted by the viewpoint tracking model, and the viewpoint positions at the multiple future moments predicted by the viewpoint tracking model constitute the first Two-viewpoint sequence; the viewpoint tracking model is constructed according to a filter algorithm related to target tracking, wherein the related filter algorithm refers to: setting a correlation filter, and the correlation filter will form a maximum effect on the video area of the viewpoint position. Response;

Combining the first viewpoint sequence and the second viewpoint sequence, the user's future viewing viewpoint sequence is determined.

2. The method for predicting a sequence of viewpoints viewed by a user for 360-degree video transmission according to claim 1, wherein the viewpoint position at the past moment of the user is used as the input of the viewpoint sequence prediction model, and the future is predicted by the viewpoint sequence prediction model. Before the viewpoint positions at multiple times, the method further includes: constructing a viewpoint sequence prediction model based on a recurrent neural network.

3. The method for predicting a user's viewing viewpoint sequence for 360-degree video transmission according to claim 2, wherein the viewpoint position of the user in the past time is used as the input of the viewpoint sequence prediction model, and the viewpoint sequence prediction model is used to predict the Viewpoint locations at multiple moments in the future, including:

Taking the viewpoint position of the user at the current moment as the input of the first iteration of the viewpoint sequence prediction model, and obtaining the predicted viewpoint position of the first iteration;

The loop uses the predicted viewpoint position of the previous iteration as the input of the next iteration of the viewpoint sequence prediction model, and obtains the predicted viewpoint positions at multiple times in the future.

4. The method for predicting a sequence of viewpoints viewed by a user for 360-degree video transmission according to claim 1, wherein the length of the first viewpoint sequence is related to the speed of the head movement when the user is watching, and the faster the head movement speed of the user is. Slower, the longer the length of the corresponding first viewpoint sequence; the faster the user's head movement speed, the shorter the length of the corresponding first viewpoint sequence.

5. The method for predicting a sequence of viewpoints viewed by a user for 360-degree video transmission according to claim 1, wherein, when using a viewpoint tracking model, video content is used as the input of the viewpoint tracking model, and the viewpoint tracking model predicts Before the viewpoint positions at multiple moments in the future, the method further includes: constructing a viewpoint tracking model according to a filter algorithm related to target tracking.

6 . The method for predicting a sequence of viewpoints viewed by a user for 360-degree video transmission according to claim 5 , wherein, in the viewpoint tracking model, video content is used as the input of the viewpoint tracking model, and the viewpoint tracking model Predict viewpoint positions at multiple moments in the future, including:

Using the equidistant cylindrical projection method, the spherical image of the 360-degree video frame in the future moment is projected into a plane image;

A bounding box is determined in the plane image by the viewpoint tracking model, the area in the bounding box is the viewpoint area, and the corresponding viewpoint position is determined according to the viewpoint area.

7. The method for predicting a user's viewing viewpoint sequence for 360-degree video transmission according to any one of claims 1-6, wherein the user's future viewing viewpoint is determined by combining the first viewpoint sequence and the second viewpoint sequence sequence, including:

Different weight values w ₁ and w ₂ are respectively set for the viewpoint positions in the first viewpoint sequence and the viewpoint positions in the second viewpoint sequence; and the weights w ₁ and w ₂ satisfy w ₁ +w ₂ =1 ; wherein: the setting of the weight values w ₁ and w ₂ needs to satisfy the principle of minimum error between the predicted user's future viewing viewpoint position and the actual user's viewing viewpoint position;

According to the weight values w ₁ and w ₂ , as well as the viewpoint positions in the first viewpoint sequence and the viewpoint positions in the second viewpoint sequence, the user's future viewing viewpoint sequence is calculated; the calculation formula is as follows:

in:

8. The method for predicting a sequence of viewpoints viewed by a user for 360-degree video transmission according to claim 7, wherein, as the prediction time increases, the weight w of the _second viewpoint sequence predicted by the viewpoint tracking model gradually increases. decrease.