Received August 5, 2021, accepted September 2, 2021, date of publication September 15, 2021,
date of current version September 28, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3113162
An Open Framework for Analyzing and
Modeling XR Network Traffic
MATTIA LECCI , (Graduate Student Member, IEEE),
MATTEO DRAGO , (Graduate Student Member, IEEE),
ANDREA ZANELLA , (Senior Member, IEEE),
AND MICHELE ZORZI , (Fellow, IEEE)
Department of Information Engineering, University of Padua, 35131 Padova, Italy
Corresponding author: Mattia Lecci (leccimat@dei.unipd.it)
This work was supported in part by the National Institute of Standards and Technology (NIST) under Award 60NANB19D122 and Award
60NANB21D127. The work of Mattia Lecci was supported by Fondazione CaRiPaRo under Grant ‘‘Dottorati di Ricerca 2018.’’
ABSTRACT Thanks to recent advancements in the technology, Augmented Reality (AR) and Virtual
Reality (VR) applications are gaining a lot of momentum, and they will surely become increasingly popular
in the next decade. These new applications, however, require a step forward also in terms of models to
simulate and analyze this type of traffic sources in modern communication networks, in order to guarantee
to the users state-of-the-art performance and Quality of Experience (QoE). Recognizing this need, in this
work we present a novel open-source traffic model, which researchers can use as a starting point both for
improvements of the model itself and for the design of optimized algorithms for the transmission of these
peculiar data flows. Along with the mathematical model and the code, we also share with the community
the traces that we gathered for our study, collected from freely available applications such as Minecraft VR,
Google Earth VR, and Virus Popper. Finally, we propose a roadmap for the construction of an end-to-end
framework that fills this gap in the current state of the art.
INDEX TERMS Traffic modeling, traffic analysis, network simulations, virtual reality applications.
I. INTRODUCTION
After several years of innovations, the technology is finally
ready for applications such as Virtual Reality (VR), Augmented Reality (AR) and Mixed Reality (MR) to go mainstream (in the following we will use the term eXtended
Reality (XR) as a general expression to consider all these
distinct interaction modes). According to some estimates,
by 2025 there will be over 200 million people using XR for
immersive gaming experience and 95 million enjoying live
events in this novel way [2]. This immediately translates in
increasing sales of devices and headsets dedicated to experience this new type of contents, with an estimated shipment of
these devices in the order of tens of millions in the coming
decade, generating billions in revenue for all the fields in
which this technology will be deployed [3]–[5].
Although it all started from the entertainment and video
gaming arenas, where players could immerse in a virtual
3D world, now we can see XR applied in various fields,
The associate editor coordinating the review of this manuscript and
approving it for publication was Xiaogang Jin
129782
.
such as building or landscape design, real estate, marketing,
and healthcare, opening up the possibility of learning new
concepts and training employees for difficult situations in a
completely different way [3], [6]–[10]. Automotive companies, for example, are using VR to cut the time that leads to
the physical model of a new product from weeks to days [5].
Regarding the general retail market, instead, VR can give
customers realistic experiences with products, allowing them
to easily consider different options and configurations [3] and
thus increasing sales and decreasing product returns.
The peculiarity of this new class of contents, besides the
wide range of use cases, is that the end user does not passively
receive the information, but acts on it, possibly affecting the
future behavior of the application itself. Hence, the traffic
flow to and from the content provider is highly dependent on
the interaction with the virtual environment in which the user
is immersed.
In this paper, we will focus on examples related to the video
gaming world (even though equivalent conclusions can be
drawn for different XR applications), where the user interacts
with the application using a keyboard or joypad, and the
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
results of such actions are immediately seen on the PC or
TV screen. Through Head Mounted Devices (HMDs), when
playing with videogames supporting VR, users can also react
by moving their heads, causing the application to stream
distinct portions of the environment depending on where the
head is pointing [11].
Even though traditionally gaming software ran on devices
which needed to respect several hardware constraints to
generate high-quality images, now the paradigm is shifting
towards a cloud approach [12], [13]. This can be extended
to all other use cases besides gaming, and for this reason,
we can refer to this new paradigm as Cloud XR. By moving
the computing and graphical processing units into the cloud,
less powerful devices can be used to fully exploit this new
technology. This would benefit not only in terms of the actual
cost of an HMD (which still plays a huge role in promoting
the adoption by new users) but also in the final Quality
of Experience (QoE). Having all the computing resources
self-contained in the device would mean not only a higher
weight and volume, but also concerns in terms of heat and
battery life [4], [9].
This shift towards cloud infrastructures requires the optimization of current communication systems, to fully support
distributed XR services. To this end, we need accurate models
of the applications that generate these data flows and, to the
best of our knowledge, no previous work has addressed this
problem so far.
In this work we try to fill this gap by proposing a traffic
model that emulates an XR application, while also sketching
a roadmap to guide researchers in the development of more
precise models, using ours as a baseline.
To further understand what are the steps that most influence
the XR performance, it is useful to describe a common endto-end XR architecture [2], [9]. First, we can start from the
collection and processing of sensory and tracking information, delegated to an ad hoc device. Then, this information
is sent to an XR server to compose the viewport, i.e., what
is actually shown to the user. This process includes 2D/3D
media encoding and the generation of additional metadata
(including the scene description). The device’s presentation
engine at the client side, after receiving and decoding the
information stream, generates the images to display. These
images are derived from the decoded signals, the rendering
metadata, and other information, if applicable. Finally, video
and audio tracks associated with the current pose are generated by synchronizing and spatially aligning the rendered
media. These steps need to be accomplished with minimal
delay to guarantee adequate QoE.
In fact, the motion-to-photon latency, i.e., the time from an
action (e.g., a head movement) to the update of what is shown
on the display, must be below 20 ms to avoid the so called
cybersickness, associated to disorientation and dizziness [2],
[14]–[18]. Following this physiological constraint, several
industry players pose the network requirements, in terms of
latency, in the range of 5–10 ms [2], [4], [9], [14], [19]. Also
in terms of the gaming QoE, it has been demonstrated that for
VOLUME 9, 2021
first-person shooters, racing games, and team soccer matches,
application latency directly impacts the results of competitive e-sports and, if not properly addressed, would lead to
abandoning the game [12]. This translates into stringent constraints both in Downlink (DL) and in Uplink (UL), considering that not only the content must be streamed as soon as it
is required, but also the user movements need to be promptly
notified to the server. For this reason, the software that collects each movement input must consider all 6 Degrees of
Freedom (6DoF), tracking both translations and rotations
in the three perpendicular axes (based on the VR device,
some may consider only rotational motion, i.e., 3 Degrees of
Freedom (3DoF)). To take immersive mobile experience to
the next level, many improvements will be required in head,
body, and even gaze tracking [7].
It is also important to distinguish between processing
latency, associated with computation and rendering, and network latency. Rendering complex gaming images can be quite
demanding, and the delay introduced by these operations can
be larger than that caused by network services, which further
motivates the need to offload these functions to proper cloud
infrastructures [12].
Besides delay-related issues, an additional problem consists in the bursty nature of the XR traffic, meaning that the
throughput measured over short time windows could be much
higher than its long-term average value [9], which can be the
case for an application that periodically generates collections
of packets to refresh the viewport. Another aspect impacting
the throughput is that, in order for the technology to be as
close as possible to human vision, we will need a higher
spatial and temporal resolution of the content presented
to the user than currently possible (i.e., 3D 360◦ 8196 ×
4096 resolution at 90 Hz and beyond display refresh rate)
[7], [9], [14].
The core technology that is expected to guarantee the
satisfaction of all these requirements, by paving the way for
an optimized distribution of processing capabilities, is 5G.
Many players have already invested in 5G for the rising of
XR, for operations at both sub-6 GHz and mmWave [7], [8],
[20], [21].
Nonetheless, even though some efforts have also been
devoted by standard bodies to the redaction of technical
reports [14], [19], at the present time researchers are limited
by the lack of precise traffic models representing the stream
to/from an XR server. Having these models would allow the
research community to design telecommunication solutions
that could reduce the delay contribution related to the network, while also considering all the processing steps. For this
reason, we propose a generative model for XR traffic sources,
obtained from real application traces, and we also delineate
a roadmap of the necessary steps to further improve it with
additional features able to cope with the aforementioned
problems, i.e., motion-to-photon latency, burstiness, capacity.
While in Sec. II we summarize the current state of the
art in the XR arena, Sec. III is devoted to the description of
the acquisition setup that we used to collect about 70 GB
129783
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
of data for a total of more than 4 hours of traced traffic
time using different VR applications, both from the hardware
and from the software point of view. We will also describe
each of these applications and illustrate how we analyzed
the dataset. The model obtained from this analysis will be
presented in Sec. IV, and its end-to-end validation, along with
some example use cases, are discussed in Sec. V. Finally,
in Sec. VI we propose a roadmap to extend our baseline model
with additional increasingly complex features, and Sec. VII
concludes the paper.
II. STATE OF THE ART
A seminal conceptual model that describes the human and
technical elements creating the participatory environments of
virtual reality systems was proposed in [22], dating back to
1994. This demonstrates that the interest in the definition of
common models for the study of this framework started even
before the technology was ready, or even invented.
Despite the research interest in this field, to the best of
our knowledge little work has been done on the creation of
generative traffic models in XR contexts, while the focus
was put on different aspects of the technology. In particular,
a huge effort has been devoted to the creation and validation of practical systems that use immersive technology to
interact with the world in different ways. An example can be
found in [23], which describes a system for the interactive
analysis of large datasets with time-dependent data, realized
on a multi-processor parallel machine in order to guarantee a smooth user experience. Instead, in [24] the authors
developed a proof-of-concept system, combining Oculus Rift
HMD and the Phantom Premium 1.5 High Force haptic
device with the goal of demonstrating the feasibility of combining HMD and haptics in one system. Also, XR solutions
have been tested for purposes of architectural design [25] and
for providing virtual performance instructions and feedback
on users that want to play a real piano [26].
From a more technical perspective, a complete overview
of the latest developments on immersive and 360◦ video
streaming can be found in [11], where the author aims at
providing a complete overview on four of the most important
challenges in this field, namely: omnidirectional video coding and compression, subjective and objective QoE and the
factors that can affect it, saliency measurement and Field-ofView (FoV) prediction, and adaptive streaming of immersive
360◦ videos. As stressed in [11], finding a proper way to
measure the user’s QoE may be difficult. This is especially
important with respect to the design of telecommunication
infrastructures able to optimize the experience of the user, and
to guarantee constant and stable service quality.
For this reason, a lot of effort has been devoted to creating
network solutions for the maximization of the quality of the
delivered content. In [27], for example, the authors proposed
a scheme for uplink delivery of tile-based VR video over
cellular networks. In particular, they formulate resource allocation as a frequency- and time-dependent non-deterministic
polynomial NP-hard problem, and propose three distinct
129784
algorithms to solve it. Instead, in [28] the authors consider a
QoE-driven transmission of VR 360◦ contents in a multi-user
massive MIMO wireless network. Specifically, in this scenario multiple users in the cell are requesting the same
content, and the goal is to optimize the reception of such
information through a stable scheme for the transmission of
the viewport tiles. In this work, they also try to allocate the
power in order to guarantee a consistent delivery rate for each
stream.
The impact of latency on the overall experience of the
user has been mentioned in Sec. I, along with the importance
of tracking the movements of the user in applications with
strict delay requirements. The authors of [29] used a real
VR head-tracking dataset to maximize the quality of the
delivered video chunk under low-latency constraints. In that
case, a deep recurrent neural network was designed for the
prediction of the users’ FoV (allowing to cluster those with
overlapping FoV) while information on the future content
and the users’ locations was used as input of a proactive
physical-layer multicast transmission scheme.
A key solution to the latency problem would be to rely
on the capabilities of 5G and edge cloud, exploiting what
has been referred to as Cloud XR in Sec. I. Indeed, in [30]
the authors demonstrated that 5G and edge cloud are necessary to sustain the requirements of applications such as
VR gaming.
All these solutions, however, lack a model capable of
generating data flows that can easily be associated with a
real XR application. The approach of [27] consisted in using
240 frames of each 8K 360◦ uncompressed video sequence
available from [31]. In that case the author applied the
HVEC Kvazaar encoding procedure, setting the frame rate
to 25 Frames Per Second (FPS) and the Group of Pictures
(GoP) size to 8, and using a constant tiling scheme, ideal for
the purpose of their work. Despite the high level of details
implemented in such a model, the use of a trace-based flow
is limiting per se, considering also the limited portion of the
video that they selected. Having an offline encoding strategy
is another drawback, that in our framework has been overcome by integrating the rendering server in the processing
pipeline.
Also in [28], the simulation setup from the point of view
of the VR architecture was defined in order to highlight the
features of the algorithms proposed by the authors, and the
nature of the traffic flow (e.g., average frame size, inter and
intra-frames correlation, inter-frame interval, etc.) was not
taken into account.
Regarding the problem of tracking the movements of the
users, in [29] the authors fed the recurrent neural network
with the 3DoF traces from [32], tracking the pose of 50 different users watching a catalog of 10 HD 360◦ videos from
YouTube (60 seconds long, 4K resolution, 30 FPS, FoV of
100◦ × 100◦ ). Having a generative model that creates such a
dataset based on statistical studies on a collection of different
traces would have greatly aided the training of the neural
network used in [29]. Also, finding a dataset that represents
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
well the problem that we want to solve is usually not feasible,
and this may further limit the research outcomes.
As a consequence, our goal is to provide the community with a tool for the automatic generation of such traces.
A preliminary version of this work was proposed in [1], and
here we extend it with the acquisition of longer and more
heterogeneous traces, that now include realistic interaction
with several VR applications. This extension also allowed a
more detailed and thorough validation of the model. Besides
making both the model and the traces public, we also propose
a possible roadmap for making the framework as complete
and detailed as possible, highlighting the most important
contributions that would benefit researchers aiming at the
design of ad hoc network protocol optimizations for this new
type of traffic sources.
III. VR TRAFFIC: ACQUISITION AND ANALYSIS
In this section, we describe our basic traffic modeling work.
Specifically, in Sec. III-A we describe our acquisition setup
and the VR applications that we acquired, then in Sec. III-B
we analyze the raw traffic traces, and the different streams
composing them, both in terms of content and in terms of
statistics.
A. ACQUISITION SETUP
For the rendering server, we used a desktop PC equipped with
an Intel Core i7 processor, 32 GB of RAM, and an NVIDIA
GeForce RTX 2080 Ti graphics card. For the headset, instead,
we used an iPhone XS enclosed in a VR cardboard, which
allows a realistic interaction with the applications. The two
nodes were connected via Wi-Fi to improve the user’s freedom of movement, at the cost of a slightly less stable channel
and of possible interference from other surrounding devices.
VR applications were thus run on the rendering server and
streamed to the headset using the application RiftCat 2.0 (on
the server), and VRidge 2.7.7 (on the phone).1 This setup
allows the user to play VR games on the SteamVR platform
for up to a maximum of 10 minutes continuously, enough to
obtain traffic traces to be analyzed (note that this limit is given
by the free version of VRidge, and is absent in the premium
version). Many settings can be tuned in this application, such
as the display resolution, the frame rate (either 30 or 60 FPS),
the target data rate (i.e., the data rate the application will try
to consistently stream to the client, which can be set from 1 to
50 Mbps), the video encoder (NVIDIA NVENC was used),
and the video compression standard (H.264 was chosen),
among other advanced settings.
As opposed to [1], we acquired traces while realistically
interacting with available VR applications using mouse, keyboard, and head movements. Our setup only allowed us to
interact with 3DoF, i.e., the user was seated and only head
rotations were sensed by the headset. In any case, in order
to increase the realism of the collected traces, the user was
not required to limit the type of movements, but could freely
1 riftcat.com/vridge
VOLUME 9, 2021
interact with the application of interest. To simplify the analysis of the traffic stream, audio was not activated.
For this purpose, we selected three popular VR applications targeting different types of interactions. Specifically:
• Minecraft: an extremely popular game, with the Vivecraft plugin enabling room-scale or seated VR experiences. The user can explore the virtual environment by
walking or swimming, and interact with the virtual world
by cutting trees, digging holes, crafting tools, etc.
• Virus Popper: during this fast-paced educational game,
many cartoony-looking viruses swarm a virtual room,
and the user has to attack them with cleaning tools for
survival.
• Google Earth VR – Tour: the VR version of Google
Earth, allowing a user to explore the world with satellite
imagery, 3D terrain of the entire globe, and 3D buildings
in hundreds of cities around the world. The SteamVR
application also enables tours, teleporting the user all
around the world every few seconds.
• Google Earth VR – Cities: in this case, a more interactive
experience is provided, allowing the user to fully explore
cities or landmarks for as long as they want.
Please note that Google Earth VR was used in two different
ways, thus allowing us to analyze two different versions of
the same application.
To capture streamed packets, we ran Wireshark, a popular
open-source packet analyzer, on the rendering server. The
traffic analysis was performed at 30 and 60 FPS for target data
rates of {10, 20, 30, 40, 50} Mbps and for all 4 applications
with a resolution of 1920 × 1080, for a total of over 70 GB
of PCAP traces and 4 hours of analyzed VR traffic. Our
dataset containing the processed VR traffic traces can be
found within our software and can be easily reused, as later
described in Sec. V.
B. TRAFFIC ANALYSIS
As described in [1], we were able to partially reverse engineer
both the DL and UL streams and, thanks to the help of
RiftCat’s developers, we are now able to reliably process the
raw traffic traces. We found that UDP sockets over IPv4 are
used and both UL and DL streams contain several types of
packets. Specifically, the UL stream contains packets such
as synchronization, video frame reception information, and
frequent small head-tracking information packets, whereas
the DL stream contains synchronization, acknowledgment,
and video frame packet bursts.
To improve the stream quality, the RiftCat team developed
a custom version of the ENet protocol,2 a relatively thin,
simple and robust network communication layer on top of
UDP, which offers reliable, in-order packet delivery.
In Fig. 1 we show a visual representation of a slice of
bidirectional VR streaming. The plot shows the main data
streams in both DL and UL, giving an idea of how this
transmission works.
2 Available: https://github.com/nxrighthere/ENet-CSharp
129785
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 1. Portion of traffic trace from Virus Popper (50 Mbps, 30 FPS). For this trace, 130–140 individual fragments make up a video frame burst.
Most of the traffic is concentrated in DL and is made up of
packet bursts encoding video frames. Video frame fragments
were consistently found to be 1320 B long in all acquired
traces, with a data size (the UDP payload) of 1278 B. The
last packet of the burst also has the same size as the others,
suggesting that padding has been used in order to simplify the
protocol, although this biases the frame size distribution to be
discrete.
The second most noticeable traffic stream is the UL head
tracking information, which the headset acquires and sends
to the rendering server to update the viewport to be rendered.
The head tracking payload was identified to be either 192 B
or 97 B long, sometimes changing over the course of a single
traffic trace, although the reason why different packet sizes
were found is unclear.
Finally, smaller packets in both UL and DL, with payloads
of respectively 21 B and 10 B, were identified to contain
feedback on the reception of video frames, which is probably
used in the streaming protocol to decide whether or not to
retransmit some frames.
By reverse-engineering the bits composing the UDP payload of video frames, it was possible to identify a recurring
set of bits suggesting a 31 B APP-layer header and allowing
us to identify some key fields, such as (i) the frame sequence
number, (ii) the number of fragments composing the frame,
(iii) the fragment sequence number, (iv) the total frame size,
and (v) a checksum. This information allowed us to reliably
process and aggregate video frames.
Given the settings of the streaming application (i.e., frame
rate and target data rate), it is clear that a Constant Bit
Rate (CBR) video encoding is performed in the background.
In Fig. 2a we show the performance of the video encoder,
almost always exceeding the target rate (though by only
5–10%). A simple explanation of this behavior might be the
underestimation of header sizes in the computations of the
CBR encoder, such as the header of the custom ENet protocol.
Notably, both frame rates behave similarly across all four
applications, with stable performance.
129786
Figs. 2b and 2c show the low overhead due to non-video
DL and UL transmissions (including head tracking), respectively. Specifically, non-video DL traffic only accounts for
3–5 kbps while UL traffic for about 135–150 kbps, with
60 FPS traces consistently showing higher rates with respect
to 30 FPS ones, probably due to the doubled amount of
feedback. Only two out of our forty traces show different
rates, possibly due to some imperfection in the streaming.
In any case, these traffic flows are much lower than the target
rates and appear constant, irrespective of the data rate or
the application. This consideration lead us to the decision
of ignoring them, focusing only on modeling the DL video
traffic.
Considering R the target data rate and F the application
frame rate, the average video frame size is expected to be
close to the ideal S = R/F, as shown in Fig. 2d. Note
that the x-axis reports the measured data rate rather than the
target data rate, i.e., the average data rate estimated from the
acquired traces, which differs slightly from the target rate,
as shown in Fig. 2a.
Furthermore, Fig. 2e shows that the average Inter-Frame
Inter-arrival (IFI) time perfectly matches the expected 1/F,
equal to 33.3̄ ms for 30 FPS traces and 16.6̄ ms for 60 FPS
traces.
Moving to the analysis of the Probability Density Functions (PDFs), it is important to know that in a collection of packets associated to a video source, we can
usually distinguish Intra-coded frames (I-frames) (sometimes
called keyframes), Predictive-coded frames (P-frames), and
Bipredictive-coded frames (B-frames). While I-frames are
compressed similarly to simple static pictures, P-frames
exploit the temporal correlation of successive frames to
reduce the compressed frame size. B-frames, instead, can
exploit the information from both previous and subsequent
frames, further improving the compression efficiency at the
cost of non-real-time transmission. All the details associated
with these compression techniques are regulated by standards
like H.264 [33].
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 2. Results from acquired VR traffic traces.
FIGURE 3. Video frame distributions for Virus Popper (30 Mbps, 60 FPS).
Interestingly, Fig. 3a shows that the frame size distribution
is unimodal rather than multimodal, as would be expected
considering the different compression levels of the I, P, and
B frames generated by a typical H.264 encoder. As confirmed by the RiftCat team, the reason for such a smooth
frame size distribution is that the encoder makes use of the
H.264 Periodic Intra Refresh compression scheme where the
reference image used to predict (and compress) the frames in
VOLUME 9, 2021
a GoP, rather than being the first image as in H.264, is instead
obtained from consecutive vertical slides taken from all the
frames in the GoP. This results in a reduced variance of the
frame size, making the encoded video stream almost CBR.
As already mentioned, VRidge simplifies the transmission
by discretizing some units. Fig. 3a shows a clear staircase
Cumulative Distribution Function (CDF) for the video frame
size, suggesting that video frames have been padded as the
129787
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 4. Video frame fit qualities for the Google Earth VR – Cities application. Fit quality is measured using the KS test (lower is better).
Box plots show median (red), 1st and 3rd quartiles (box), minimum and maximum (whiskers) of the KS test with a given distribution,
while markers show the exact values for the different traces.
underlying distribution is indeed discrete, with a distance
between consecutive stairs of 1278 B, i.e., the UDP payload
of packets containing fragments of video frames.
Similarly, the IFI time also appears to be discretized
with a millisecond precision around the mean F1 , as seen
in Fig. 3b, although some noise due to, e.g., variable rendering
and encoding time, wireless channel condition, transmission queue state, transmission times, just to mention a few,
smooths the CDF.
IV. TRAFFIC MODEL
Following the analysis of Sec. III-B, in this section we will
describe the proposed model for VR traffic based on the
collected VR traffic traces.
The analysis presented in the previous section reveals that
both packet sizes and IFI times appear to be discrete in the
collected data traces. However, such granularity is likely due
to specific design choices of the communication protocols
used by the considered applications, rather than being a native
characteristic of the XR services. Therefore, we believe it is
more suitable to use continuous random variables to model
the size of the data blocks generated by the XR application
and the time between them. By doing so, we free our model
from the specific constraints of this streaming application,
with no loss of generality (as the discrete case can always
be obtained from the continuous one), and in fact making
it easier to accommodate other (non-discrete) cases in our
framework if needed.
A. DISTRIBUTION FITTING
Given the extremely large number of samples per trace
(200–600 s at 30 or 60 FPS), common quality of fit statistical
tests yield poor performance due to the discretized distributions. Intuitively, while the PDF of discrete and continuous
129788
distributions takes completely different forms, the CDF of a
discretized distribution is simply a staircased version of the
related continuous distribution. In that case, the goodness of
fit can be tested by comparing the CDFs, for example using
the Kolmogorov-Smirnov (KS) test [34], defined as:
KS = sup |Fe (x) − Ft (x)| ,
(1)
x
where supx is the supremum of the set of distances, Fe (x) is
the empirical CDF of the acquired data, and Ft (x) is the CDF
of the target distribution. The KS test will thus be used to
score the quality of fit, where values closer to zero indicate a
better parameter estimation.
To fit and evaluate the best probability distributions for
our data, we used the popular SciPy library [35]. We tested
15 of the most common continuous univariate distributions
available in the scipy.stats package, evaluating their
performance on both frame size and IFI on our traffic traces.
Note that the SciPy library performs a maximum likelihood estimation of the parameters of the distribution, including location and scale, and applies them to all continuous
distributions by transforming the random variable X into
(X − loc)/scale. Given the exceptional accordance between
expected values and computed averages (see Figs. 2d and 2e)
and considering the proposed generative model (described in
Sec. IV-B), we fixed the location parameter to the expected
value (i.e., R/F for the frame size and 1/F for the IFI), fitting
only the scale and the remaining parameters. A selection of
distributions is shown in Fig. 4.
We found that the Student’s t and Logistic distributions,
closely followed by the Laplace, Gaussian, and Cauchy distributions, were the best fitting ones in almost all traces
for both frame size and IFI. Fig. 5 shows how similar the
fitted distributions actually are. Although the Student’s t
distribution performs slightly better than the Logistic one in
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 5. Comparison of the three best fitting distributions for Virus Popper (30 Mbps, 60 FPS). The KS test is also shown, where lower
values indicate a better fit.
the slight majority of the collected traces, in our case the
Logistic distribution was the best choice. In fact, the third
parameter of the Student’s t distribution is only able to
yield minuscule improvements over the Logistic distribution,
which only needs two parameters. Furthermore, if custom
simulators need to manually implement the desired random
stream, the Student’s t distribution is very hard to reproduce [36], while the Logistic distribution requires a simple
transformation. This is the case when common libraries for
random number generation cannot be used, such as in our
implementation described in Sec. V.
As a reference, we use SciPy’s definition of a logistic
distribution, with PDF in its standardized form as follows:
f (x) =
e−x
.
(1 + e−x )2
(2)
To shift or scale the distribution, the location and scale
parameters are used as previously described.
B. GENERATIVE MODEL
Now that we characterized and fitted the statistical distributions of the 40 acquired traces, we want to define a generative
model which would allow a user to synthesize XR traffic at
will, be it for analysis or simulation purposes. As already
discussed in previous sections, in this paper we propose a
simple generative model, that only attempts to capture the
statistical distributions of video frame size and Inter-Frame
Inter-arrivals (IFIs), leaving higher-order statistical descriptions for future work.
We define the dispersion as the ratio of the scale over the
location parameter, attempting to find a common value for
both frame rates, since absolute values are likely to differ by
a constant factor (see Figs. 2d and 2e). While data aggregation
is doable for frame sizes (as shown, for example, in Fig. 6a),
data for IFI did not allow us to do so. As shown in Fig. 6b,
in fact, data for 30 and 60 FPS behaves differently, making it
impossible for us to create a single model for this parameter.
This implies that our model will only be able to generalize
over data rates, whereas 30 and 60 FPS are the only supported
VOLUME 9, 2021
frame rates, and modeling and testing different values would
require new data for the corresponding frame rate.
After carefully studying the acquired traffic traces, we propose to generalize the scale parameters for both video frame
size and IFI time with a power law, namely:
y = ax b .
(3)
Furthermore, as Fig. 6b suggests, the 60 FPS IFI fits for
all applications resulted in |b| < 10−4 , suggesting a constant
behavior, irrespective of the data rate. In that case, we thus
assumed a constant fit (a corner case of power law with b = 0)
by computing the average value across all tested target data
rates.
As can clearly be observed from the collected data, the proposed model has been extracted from acquisitions between
about 10 and 50 Mbps, thus using it beyond these limits is
not advisable since no data in our possession can validate the
quality of the synthetic traces.
We let different applications have separate models, obtaining a data set of 10 traces per application (half at 30 FPS,
half at 60 FPS). The parameters for all applications can be
found in Table 1 and the generative algorithm is summarized
in Algorithm 1.
TABLE 1. Parameters of the proposed generative model. Each
VR application is characterized by five parameters: two for the frame size
dispersion DFS = αx β , one for the 60 FPS IFI dispersion DIFI = γ , two for
the 30 FPS IFI dispersion DIFI = δx ǫ .
V. SIMULATION RESULTS
To further test the validity of the proposed model, we implemented it on top of Network Simulator 3 (ns-3), a popular
open-source full-stack simulation software, and made it publicly available together with the processed VR traffic traces
129789
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 6. Generalization models for the Google Earth VR – Cities application. Individual points show the scale parameter of the Logistic
model fitted on the acquired data, while the dashed red lines attempt to generalize the model to intermediate target data rates.
Algorithm 1 Generative Model for XR Traffic
Require: AppName, FrameRate, DataRate
1: FsAvg = DataRate / FrameRate
2: IfiAvg = 1 / FrameRate
3: α, β, γ , δ, ǫ = GetParameters (AppName) {see Table 1}
4: FsDispersion = α · DataRateβ
5: FsScale = FsDispersion · FsAvg
6: if FrameRate == 60 then
7:
IfiDispersion = γ
8: else if FrameRate == 30 then
9:
IfiDispersion = δ · DataRateǫ
10: else
11:
Error: only 30 and 60 FPS supported
12: end if
13: IfiScale = IfiDispersion · IfiAvg
in CSV format.3 Further details on the implementation of this
traffic model on ns-3 can be found in [1].
To test our model, we set up simulation campaigns where
multiple users equipped with HMDs communicate with a
central Wi-Fi Access Point (AP), using a wireless connection based on the IEEE 802.11ac standard. The central AP
also acts as rendering server, generating one VR stream for
each receiving Station (STA) of the scenario. Transmissions
randomly start within the first second of simulation, avoiding
that different streams start at the same time.
We show results for traffic streams imported directly from
the acquired traces as well as for our model. Since a single
trace is available for each parameter combination (i.e., application, frame rate, data rate), for a fixed parameter combination the traffic flows will all come from the same trace,
although different 60 s windows are sampled to further decouple different users. Instead, simulations running our proposed
model have been repeated twice: one with the target data rate
submitted to VRidge when acquiring the corresponding trace,
and one with the empirical data rate measured directly from
the acquired traces (the two rates differ slightly, as can be seen
3 Available in the ns-3 app store: https://apps.nsnam.org/app/bursty-app/,
Release v1.0.0
129790
from Fig. 2a). This information can also be found directly in
the metadata of the acquired traces, made available in CSV
format together with our model.
A. MODEL VALIDATION
Exhaustive simulation campaigns have been run for all four
applications and five data rates at both 30 and 60 FPS, each
repeated 10 times to obtain solid average statistics. Confidence intervals are not shown as they are extremely tight.
Additional simulation parameters are shown in Table 2.
TABLE 2. List of simulation parameters.
In the following section, plots will show burst-level rather
than fragment-level metrics, which in case of a video stream
are much more informative and bring a more realistic perspective on the quality perceived by the user. In fact, in this
case we are more interested in the performance regarding full
video frames rather than single packets, and thus all packets
from a burst will have to be collected before the HMD will
be able to process and show the frame to the user.
To validate our proposed model, we simulate a scenario as
similar as possible to our acquisition setup, where a rendering
server transmits the VR stream to a single user. Note that the
Wi-Fi connection is able to withstand hundreds of megabitsper-second, thus a single user transmitting up to 50 Mbps
is largely underutilizing the channel, allowing us to obtain
unbiased results with respect to the limits of the channel
capacity. We simulated all 40 combinations of parameters
(4 applications, 5 data rates, 2 frame rates), although we only
show results for the 10 related to the Google Earth VR – Cities
application in Fig. 7.
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 7. Simulation results for a single user streaming the Google Earth VR – Cities application over a Wi-Fi link. The statistics refer to
fully received frames rather than to single fragments.
In Fig. 7a we show the average throughput obtained by
the 3 simulation campaigns in the 10 parameter sets. Clearly,
both 30 and 60 FPS runs obtain similar results, since this
metric disregards the frame rate. In fact, both models targeting the nominal rate (shown with dots and circular markers)
are perfectly superimposed on the main diagonal. Simulations using the original traffic traces, instead, tend to have
a slightly higher throughput (solid line with cross markers),
as was expected by looking at Fig. 2a. Since data rate, frame
size, and, conversely, latency are correlated, we matched our
model’s data rate with the empirical one, as shown by the
dashed line with square markers. As the flexibility of our
model allows us to choose an arbitrary target rate, we can see
a perfect match in the computed average throughput.
In Fig. 7b, instead, we show the average frame delay
measured from the Application (APP) layer of the AP to
the APP layer of the STA. Processing, encoding/decoding
and other technical delays must be added to obtain the full
motion-to-photon latency, and thus the network delay should
remain below 5–10 ms, as mentioned in Sec. I. The most
noticeable difference with respect to the previous figure is
that the two frame rates are clearly separated. This is because
our reference application, described in Sec. III-B, allows us to
choose a target data rate, trying to maintain a CBR transmission during the whole duration of the stream. This translates
into frame sizes which depend directly on the frame rate,
following the formula S = R/F as described in Sec. III-B.
Since the channel capacity for these simulations is kept constant, doubling the frame rate halves the frame sizes which,
in turn, halves the average video frame delay. As expected,
the model using the target rate slightly underestimates the
average frame delay, which depends on the real application
throughput, always slightly lower than the one empirically
computed from the traffic traces. Instead, similarly to the
average throughput, setting the model to the trace’s empirical
rate yields an almost perfect match with the VR traces we
acquired. Finally, notice that the average frame delay always
remains below 3 ms, well below the bound suggested by the
industry experts [2], [4], [9], [14], [19].
To complete this analysis, in Fig. 7c we report the 95th percentile delay performance of our simulations. This metric is
important as it gives an idea of the worst-case performance of
VOLUME 9, 2021
the network. In fact, only ensuring average performance is not
enough to obtain a smooth and appreciable user experience,
since frequent stutters in the streamed video might easily
ruin the interactivity of the application and even disorient
the user. Ensuring that the 95th percentile of the delay is
within acceptable bounds allows for a more fluid and overall
better experience. In the case under analysis, it can be easily
seen that both models using target and empirical rate slightly
underestimate the frame delay of the acquired traces. It is
likely that the fitted Logistic distribution is not able to fully
grasp the minute details of the traffic trace, making our model
unable to match the real traffic.
Note that, while these results are bound to the specifications of the network under analysis (e.g., MCS, channel
width, guard interval duration, fragment size, presence of
RTS/CTS, Wi-Fi standard, mobility, environment) the framework that we proposed is general. This suggests that it can
be used to study a variety of more or less complex scenarios
and network architectures with different sets of parameters,
assessing how they affect the end-to-end performance.
To conclude, it appears that our model is indeed able to reliably predict average statistics, while it could still be improved
to better mimic slightly more advanced and specific features.
These refinements will be pursued in our future work.
B. USE CASE EXAMPLE
Finally, we propose a simple example use case for our
VR traffic generator. We consider a VR arena setting, where
multiple users are attached directly to a single AP streaming
wirelessly. We assume that each user requests a 50 Mbps
stream and observe how many STAs can be supported by an
arena with an analogous setup.
As expected, we notice again from Fig. 8 that our model
needs to be calibrated against the empirical rate of the
acquired trace to yield reliable results. In fact, from Fig. 8a
we can see that the average throughput of the calibrated model
perfectly matches the throughput of the traffic trace up to at
least 8 users, where the network is able to support more than
400 Mbps.
In Fig. 8b it is possible to see an unstable network condition, when 8 users are trying to stream simultaneously.
129791
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
FIGURE 8. Simulation results for multiple users streaming the Google Earth VR – Cities application over a Wi-Fi link. The statistics refer to
fully received frames rather than to single fragments.
It appears that the slightly higher throughput required by the
trace and the empirical rate model with respect to the target
rate model is enough to push the network to its limit, resulting
in a sudden increase of the average frame delay, at both
30 and 60 FPS. Focusing on the 30 FPS simulations, the plot
shows that up to 6 users can be supported within the 5 ms
bound, while 7 users slightly exceed this limit, and finally
8 users make the network unstable and are thus pushed over
the 10 ms limit for both the trace and the model using the
empirical rate. It is important to notice that the more unstable
the network, the worse the prediction accuracy of our model.
This is probably due to the simplifications that we introduced,
such as the Logistic distribution and the uncorrelated samples
for both the IFI time and frame size stochastic processes.
Similarly, at 30 FPS, up to 7 users can be supported, but an
additional user makes the system highly unstable and with
poor prediction performance from our model.
Finally, in Fig. 8c we show the results for the 95th percentile of the delay. Similarly to the average delay, this metric
also shows the instability of the network for 8 users with much
worse performance. Focusing on 30 FPS, the system is able
to keep the delay below the 5 ms bound only when no more
than 2 users are present, whereas up to 5-6 users can be served
if a 10 ms delay is still deemed acceptable. Instead, at 60 FPS
up to 6 users can be served while keeping the network delay
within 5 ms, while the 10 ms limit is only surpassed when the
network becomes unstable with 8 users.
These counterintuitive conclusions come from the fact that
the application fixes a data rate, not a quality of experience.
This means that doubling the frame rate results in halving
the frame size, thus reducing the perceived image quality of
the streamed application, which turns into an almost halved
delay. Fixing a constant bit rate thus results in higher frame
rates yielding lower latencies, at the cost of a lower image
quality.
In general, there is good accordance between the results
predicted by the calibrated model and the traffic traces, while
the uncalibrated model often shows overly optimistic results.
When the traffic in the network increases too much and the
network becomes unstable, the three simulations diverge significantly, making our synthetic traces less reliable, although
this is a corner case that might be of lesser interest.
129792
VI. XR TRAFFIC MODELING ROADMAP
Starting from the model described in the previous sections,
in the following we propose an end-to-end framework to
evaluate network solutions, tailored for XR applications. The
goal is to list and detail the tasks required for the construction
of such a framework, in order to encourage researchers in this
field to advance with their work the state of the art, using our
baseline as a valid starting point.
While Sec. VI-A is devoted to highlighting our contributions, in Secs. VI-B to VI-E we set down each additional task,
describing how they can lead to the optimization of network
protocols.
A. EXPLOITING FIRST-ORDER STATISTICS
The model proposed in this paper, despite its basic functionalities, represents a solid foundation on top of which future
works can iterate to develop more sophisticated strategies.
In particular, we designed an open-source, highly customizable setup (described in Sec. III) to acquire traffic traces by
sniffing the packets traveling on the local network where the
experiments were conducted.
At this stage, packets are generated following first-order
statistics, sampling the size and inter-frame interval from the
distribution fitted on the collected data (see Sec. III-B). As a
consequence, with this model we can emulate the creation
of application frames that replicate the strategy implemented
by the rendering server used in our experiments. While this
model is already useful for some applications, it lends itself to
several interesting extensions, which capture other important
features of the statistics of XR traffic. As an example, in the
rest of this section we discuss the importance of studying the
correlation between different packets and of understanding
how the movements of the user impact the generated traffic
as two key areas of future improvement for our model.
B. INTRODUCING TEMPORAL CORRELATION
More advanced studies can be carried out to improve the
model with additional features. One important aspect to elaborate on is the correlation among subsequent frames, or even
within a specific group of frames.
As mentioned in Sec. III-B, when compressing a
video stream both intra-frame and inter-frame compression
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
techniques could be exploited, and this influences not only
the structure of the packets since the type of compression
greatly influences the frame size, but also the strategy to inject
them into the network. It is also possible that some manufacturers use advanced coding techniques such as Periodic
Intra Refresh, as was explained in Sec. III-B for the streaming
application used for our analysis, or more advanced standards
such as H.265 [37] using different compression techniques.
In that case, the importance of temporal correlation might
decrease, although further analysis should be carried out to
ensure this.
It should be clear, by now, that the availability of a model
capable of generalizing how such frame sequences are created, independently of the technical setup, is important, and
the fact that each manufacturer may use its own policy represents an additional challenge. In addition, having a model that
integrates and generalizes the temporal correlation among
frames would allow researchers to elaborate strategies to
guarantee a certain level of latency and throughput, for example by giving different priorities and scheduling options to
different types of packets.
For applications with constant delay requirements and high
values of FPS, a solution could be to buffer (at the device side
or at the rendering server) specific packets associated with
keyframes, in order to improve the encoding process. This
would require stable network performance and an application
capable of communicating directly with the network, e.g.,
exploiting cross-layer solutions, to be aware of any change of
the link quality that would trigger specific countermeasures
or improvements, if applicable.
C. INTRODUCING HEAD TRACKING
A further improvement of the model should exploit the information on movement tracking, in particular related to the
head, for all 6DoF. In this case, sniffing the packets traveling
through the network might not be enough, and we thus need
to gather information from different sensors (e.g., gyroscope,
accelerometer and compass), that could be integrated into the
device used to interact with the virtual world.
With respect to VRidge, the software that we used to make
our phone acting as a VR headset and our PC as a rendering server, the developers provide an API for this purpose.4
By connecting to the head tracking endpoint, the software
provides positional, rotational, or combined data, and even
the possibility of modifying phone tracking data in real time
before it is used for the rendering step.
This is important because, by aligning the motion trace
with the traffic generated by the application, it can be determined whether there is correlation between a certain movement of the user and the corresponding drop in the reception
of packets, or other network-related events. For example,
knowing the direction of the physical movement of the user
might help mmWave wireless systems (such as 802.11ad/ay)
keep beam alignment between the AP and the user device,
4 https://github.com/RiftCat/vridge-api
VOLUME 9, 2021
thus limiting the risk of abrupt connection interruptions if the
line of sight is lost.
It is to be highlighted that this approach could benefit every
communication infrastructures that can be used to deliver XR
content, as user tracking data can be exploited at different
layers of the protocol stack.
D. FULL TRAFFIC EMULATOR
The last step to further increase the fidelity (but also the
complexity) of the traffic model is to fully characterize and
emulate all the different information sub-flows and how they
interact with each other. For example, as shown in Fig. 1 and
explained in Sec. III-B, the VR stream comprises both DL and
UL messages containing information such as video frames,
head tracking information, and feedback.
A full-blown emulator would send all this information to
and from the user, reacting accordingly whenever a packet is
lost or corrupted, or when communication delays are present.
This level of detail requires a much more in-depth analysis of the transmission protocol of a real XR application,
understanding all the consequences of erratic and unexpected
behaviors of the network.
Such a precise model would be extremely useful when
running large simulation campaigns as it would give the most
accurate and reliable results. However, the amount of work
required to analyze and reproduce a realistic behavior would
be extremely high.
E. QoE-CENTRIC XR
As highlighted thoroughly in the previous paragraphs,
the final goal of all these approaches is to guarantee high-level
performance to the final user. In particular, in the XR domain,
we tend to measure the performance in terms of overall satisfaction of the customers, referred to as QoE, and, to the best
of our knowledge, there is no standardized way to evaluate
these metrics.
In our case, besides the quality of the shown image, also
the latency of the communication between the HMD and the
rendering server can make a difference (especially if the latter
is in the cloud), considering that cybersickness has a huge
impact on the user experience. For this reason, researchers
should be encouraged to design algorithms that guarantee
stable and constant performance, taking into account that the
traffic in the network varies depending on the application and
user activity.
Moreover, since in a common scenario we have different users, there may be a need to support different traffic categories at the same time in the same network. This
requires a system able to fairly distribute resources among
the flows, where learning algorithms could be implemented
to orchestrate every operation, either from a network or from
an application perspective.
Given a certain condition of the user, or other available
information, the algorithm could predict the QoE trend and
act accordingly in case of an anticipated performance drop.
At this point on the roadmap, the network design should
129793
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
focus on the user, trying to guarantee a stable experience
also when Variable Bit Rate (VBR) flows are considered.
In fact, in a CBR flow (much easier to handle from a network
point of view), the perceived image quality can be affected
in case of a scene with a large amount of action and details.
In this case, it may be difficult to fit everything at a fixed rate
and, as a consequence, the user experiences a downgrade in
terms of quality. This further highlights the need for novel
solutions, able to tackle these problems by trading off system
complexity and QoE.
VII. CONCLUSION
In this paper we described the current state of the art regarding the telecommunication aspects needed to support highquality XR streaming, mainly focusing on the challenges
needed to obtain faithful traffic models that the community
could use to test protocols and optimize networks.
We then proceeded to acquire over 4 hours of VR traffic,
study in detail this type of traffic, and propose a model to
generate synthetic traffic traces, while also making freely
available to the community both our implementation and the
VR dataset.
Finally, we show some results on the predictive power
of our model, while also acknowledging its weak points.
Furthermore, we provided an example use case where multiple users coexist in the same network, naively sharing radio
resources up to its collapse. Further work could better study
effective scheduling strategies for XR traffic streams, possibly coexisting with other applications in the same network
while also ensuring robustness in case of fluctuating channel
quality. Also, the model could be tested and validated for
higher values of FPS, by collecting and analyzing additional
traces at 90 FPS or higher. All the tasks that we think are
necessary to build a complete framework for traffic generation have been listed in Sec. VI and represent possible future
directions for this work.
With this contribution, we hope to pave the way for the
research community to start working towards the optimization and support of this specific type of traffic, given the
extreme interest from the main standard bodies and the most
prominent telecommunication industries.
ACKNOWLEDGMENT
The authors would like to thank RiftCat Team for patiently
answering all the disclosable technical questions they asked,
allowing them to improve their work. The identification
of any commercial product or trade name does not imply
endorsement or recommendation by NIST, nor is it intended
to imply that the materials or equipment identified are
necessarily the best available for the purpose. A preliminary version of this paper proposing a simpler traffic model with fewer traffic traces was presented at the
ACM Workshop on ns-3 (WNS3), in June 2021 [DOI:
10.1145/3460797.3460807].
129794
REFERENCES
[1] M. Lecci, A. Zanella, and M. Zorzi, ‘‘An ns-3 implementation of a bursty
traffic framework for virtual reality sources,’’ presented at the ACM Workshop ns-3 (WNS3), Jun. 2021, pp. 1–9, doi: 10.1145/3460797.3460807.
[2] Huawei Technologies. (2016). Empowering Consumer-Focused Immersive VR and AR Experiences With Mobile Broadband. [Online].
Available: https://www.huawei.com/en/industry-insights/outlook/mobilebroadband/insights-reports/vr-and-ar
[3] Oculus Business. (Sep. 2020). Virtual Reality—Set to Enter the Business
Mainstream. [Online]. Available: https://go.facebookinc.com/securitywhitepaper.html
[4] Huawei Technologies. (2021). AR Insight and Application Practice.
[Online]. Available: https://carrier.huawei.com/~/media/CNBGV2/downl
oad/bws2021/ar-insight-and-application-practice-white-paper-en.pdf
[5] PriceWaterhouse Coopers. (2019). Seeing is Believing–How
Virtual Reality and Augmented Reality are Transforming Business
and the Economy. [Online]. Available: https://www.pwc.com/
gx/en/technology/publications/assets/how-virtual-reality-and-augmentedreality.pdf
[6] ZTE. (2019). 5G Cloud XR Application. [Online]. Available:
https://www.mobile360series.com/wp-content/uploads/2019/09/zte-white
-paper.pdf
[7] Qualcomm Technologies. (Nov. 2020). The Mobile Future of eXtended
Reality (XR). [Online]. Available: https://www.qualcomm.com/media/
documents/files/the-mobile-future-of-extended-reality-xr.pdf
[8] Ericsson. (Apr. 2020). How 5G and Edge Computing Can Enhance
Virtual Reality. [Online]. Available: https://www.ericsson.com/en/blog/20
20/4/how-5g-and-edge-computing-can-enhance-virtual-reality
[9] 5G Americas. (Nov. 2019). 5G Services Innovation. [Online]. Available:
https://www.5gamericas.org/wp-content/uploads/2019/11/5G-ServicesInnovation-FINAL-1.pdf
[10] Deloitte. (Apr. 2018). Real Learning in a Virtual World. [Online].
Available: https://www2.deloitte.com/us/en/insights/industry/technology/
how-vr-training-learning-can-improve-outcomes.html
[11] F. Chiariotti, ‘‘A survey on 360-degree video: Coding, quality of experience
and streaming,’’ Comput. Commun., vol. 177, pp. 133–155, Sep. 2021.
[12] Nokia. (2020). Cloud gaming and 5G–Realizing the opportunity. [Online].
Available: https://onestore.nokia.com/asset/207843
[13] Huawei. (2020). Preparing for a Cloud AR/VR Future. [Online].
Available:
https://www-file.huawei.com/-/media/corporate/
pdf/x-lab/cloud_vr_ar_white_paper_en.pdf?la=en
[14] Extended Reality (XR) in 5G, document (TR) 26.928, 3GPP, Dec. 2020.
[15] L. J. Hettinger and G. E. Riccio, ‘‘Visually induced motion sickness in
virtual environments,’’ Presence, Teleoperators Virtual Environ., vol. 1,
no. 3, pp. 306–310, Jan. 1992.
[16] E. L. Groen and J. E. Bos, ‘‘Simulator sickness depends on frequency of
the simulator motion mismatch: An observation,’’ Presence, Teleoperators
Virtual Environ., vol. 17, no. 6, pp. 584–593, 2008.
[17] S. von Mammen, A. Knote, and S. Edenhofer, ‘‘Cyber sick but still having
fun,’’ in Proc. 22nd ACM Conf. Virtual Reality Softw. Technol., Nov. 2016,
pp. 325–326.
[18] H. G. Kim, W. J. Baddar, H.-T. Lim, H. Jeong, and Y. M. Ro, ‘‘Measurement of exceptional motion in VR video contents for VR sickness
assessment using deep convolutional autoencoder,’’ in Proc. ACM Symp.
Virtual Reality Softw. Technol. (VRST), Gothenburg, Sweden, Nov. 2017,
pp. 1–7.
[19] Requirements for Mobile Edge Computing Enabled Content Delivery Networks, document F.743.10, ITU, Nov. 2019.
[20] Orange. (Jul. 2020). XR: 5G Extends the Boundaries of Reality.
[Online]. Available: https://hellofuture.orange.com/en/xr-5g-extends-theboundaries-of-reality/
[21] Samsung Research. (Jul. 2020). Samsung 6G Vision. [Online]. Available:
https://news.samsung.com/global/samsungs-6g-white-paper-lays-out-thecompanys-vision-for-the-next-generation-of-communications-technology
[22] J. N. Latta and D. J. Oberg, ‘‘A conceptual virtual reality model,’’ IEEE
Comput. Graph. Appl., vol. 14, no. 1, pp. 23–29, Jan. 1994.
[23] B. Hentschel, M. Wolter, and T. Kuhlen, ‘‘Virtual reality-based multi-view
visualization of time-dependent simulation data,’’ in Proc. IEEE Virtual
Reality Conf., Mar. 2009, pp. 253–254.
[24] E. Saad, W. R. J. Funnell, P. G. Kry, and N. M. Ventura, ‘‘A virtualreality system for interacting with three-dimensional models using a haptic
device and a head-mounted display,’’ in Proc. IEEE Life Sci. Conf. (LSC),
Oct. 2018, pp. 191–194.
VOLUME 9, 2021
M. Lecci et al.: Open Framework for Analyzing and Modeling XR Network Traffic
[25] O. Ergun, S. Akin, I. G. Dino, and E. Surer, ‘‘Architectural design in
virtual reality and mixed reality environments: A comparative analysis,’’
in Proc. IEEE Conf. Virtual Reality 3D User Interface (VR), Mar. 2019,
pp. 914–915.
[26] R. Guo, J. Cui, W. Zhao, S. Li, and A. Hao, ‘‘Hand-by-hand mentor: An
AR based training system for piano performance,’’ in Proc. IEEE Conf.
Virtual Reality 3D User Interface Abstr. Workshops (VRW), Mar. 2021,
pp. 436–437.
[27] J. Yang, J. Luo, D. Meng, and J.-N. Hwang, ‘‘QoE-driven resource allocation optimized for uplink delivery of delay-sensitive VR video over cellular
network,’’ IEEE Access, vol. 7, pp. 60672–60683, 2019.
[28] L. Teng, G. Zhai, Y. Wu, X. Min, W. Zhang, Z. Ding, and C. Xiao, ‘‘QoE
driven VR 360◦ video massive MIMO transmission,’’ IEEE Trans. Wireless
Commun., early access, Jul. 9, 2021, doi: 10.1109/TWC.2021.3093305.
[29] C. Perfecto, M. S. Elbamby, J. D. Ser, and M. Bennis, ‘‘Taming the latency
in multi-user VR 360◦ : A QoE-aware deep learning-aided multicast framework,’’ IEEE Trans. Commun., vol. 68, no. 4, pp. 2491–2508, Apr. 2020.
[30] B. Krogfoss, J. Duran, P. Perez, and J. Bouwen, ‘‘Quantifying the value
of 5G and edge cloud on QoE for AR/VR,’’ in Proc. 12th Int. Conf. Qual.
Multimedia Exper. (QoMEX), Athlone, Ireland, May 2020, pp. 1–4.
[31] X. Liu, Y. Huang, L. Song, R. Xie, and X. Yang, ‘‘The SJTU UHD
360-degree immersive video sequence dataset,’’ in Proc. Int. Conf. Virtual
Reality Visualizat. (ICVRV), Oct. 2017, pp. 400–401.
[32] W.-C. Lo, C.-L. Fan, J. Lee, C.-Y. Huang, K.-T. Chen, and C.-H. Hsu,
‘‘360◦ video viewing dataset in head-mounted virtual reality,’’ in Proc. 8th
ACM Multimedia Syst. Conf., Taipei, Taiwan, Jun. 2017, pp. 211–216.
[33] Advanced Video Coding for Generic Audiovisual Services, document H.264, ITU, Aug. 2021.
[34] I. M. Chakravarti, R. G. Laha, and J. Roy, Handbook of Methods of Applied
Statistics, vol. 1. Hoboken, NJ, USA: Wiley, 1967.
[35] P. Virtanen et al., ‘‘SciPy 1.0: Fundamental algorithms for scientific computing in Python,’’ Nature Methods, vol. 17, pp. 261–272,
Feb. 2020.
[36] W. T. Shaw, ‘‘Sampling Student’s T distribution-use of the inverse cumulative distribution function,’’ J. Comput. Finance, vol. 9, no. 4, pp. 37–73,
2006.
[37] High Efficiency Video Coding, document H.265, ITU-T, Aug. 2021.
MATTIA LECCI (Graduate Student Member,
IEEE) received the B.Sc. degree (Hons.) in information engineering and the M.Sc. degree (Hons.)
in telecommunication engineering from the University of Padua, Italy, in 2016 and 2018, respectively, where he is currently pursuing the Ph.D.
degree in information engineering.
He was a Guest Researcher with the National
Institute for Standards and Technology (NIST),
in 2018. His main research interests include channel modeling for the mmWave frequency band, MAC scheduling for WiGig
technologies, applied machine learning for communications, virtual reality
traffic modeling, and open-source software development.
VOLUME 9, 2021
MATTEO DRAGO (Graduate Student Member,
IEEE) received the B.Sc. and M.Sc. degrees in
telecommunication engineering from the University of Padua, Italy, in 2016 and 2019, respectively,
where he is currently pursuing the Ph.D. degree.
He visited Nokia Bell Labs, Dublin, in 2018, working on QoS provisioning in 60 GHz networks.
His current works include the study of protocols
and solutions for vehicular networks operating at
millimeter-wave, the modeling and optimization
of network solutions to support next-generation cloud applications, and
open-source software development.
ANDREA ZANELLA (Senior Member, IEEE)
received the Laurea degree in computer engineering from the University of Padua, Italy, in 1998,
and the Ph.D. degree, in 2001. In 2000, he has
spent nine months with Prof. Mario Gerla’s
Research Team at the University of California,
Los Angeles (UCLA). He is currently a Full Professor with the Department of Information Engineering (DEI), University of Padua. He is one of
the coordinators of the SIGnals and NETworking (SIGNET) Research Laboratory. His research interests include the fields
of protocol design, optimization, and performance evaluation of wired and
wireless networks. He has been serving as a Technical Area Editor for
the IEEE INTERNET OF THINGS JOURNAL and an Associate Editor for the
IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, the IEEE
COMMUNICATIONS SURVEYS AND TUTORIALS, and Digital Communications and
Networks.
MICHELE ZORZI (Fellow, IEEE) received the
Laurea and Ph.D. degrees in electrical engineering
from the University of Padua, Italy, in 1990 and
1994, respectively.
From 1992 to 1993, he was on leave at the
University of California at San Diego (UCSD).
In 1993, he joined the Dipartimento di Elettronica
e Informazione, Politecnico di Milano, Italy. After
spending three years with the Center for Wireless
Communications, UCSD, in 1998, he joined the
School of Engineering, University of Ferrara, Italy, where he became a Professor, in 2000. Since November 2003, he has been a Faculty Member with
the Department of Information Engineering, University of Padua. His current
research interests include performance evaluation in mobile communications
systems, the Internet of Things, cognitive communications and networking,
5G mmWave cellular systems, vehicular networks, and underwater communications and networks.
Dr. Zorzi has served the IEEE Communications Society as a Memberat-Large for the Board of Governors, from 2009 to 2011 and from 2021 to
2023, as the Director of education, from 2014 to 2015, and the Director
of journals, from 2020 to 2021. He received several awards from the
IEEE Communications Society, including the Best Tutorial Paper Award,
in 2008 and 2019, the Education Award, in 2016, Stephen O. Rice Best
Paper Award, in 2018, and Joseph LoCicero Award for exemplary service
to publications, in 2020. He was the Editor-in-Chief of the IEEE Wireless
Communications Magazine, from 2003 to 2005, the IEEE TRANSACTIONS
ON COMMUNICATIONS, from 2008 to 2011, and the IEEE TRANSACTIONS ON
COGNITIVE COMMUNICATIONS AND NETWORKING, from 2014 to 2018.
129795