WO2021137856A1 - Optimal format selection for video players based on predicted visual quality using machine learning - Google Patents
Optimal format selection for video players based on predicted visual quality using machine learning Download PDFInfo
- Publication number
- WO2021137856A1 WO2021137856A1 PCT/US2019/069055 US2019069055W WO2021137856A1 WO 2021137856 A1 WO2021137856 A1 WO 2021137856A1 US 2019069055 W US2019069055 W US 2019069055W WO 2021137856 A1 WO2021137856 A1 WO 2021137856A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- frames
- machine learning
- training
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
- H04N17/004—Diagnosis, testing or measuring for television systems or their details for digital television systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
- H04N17/002—Diagnosis, testing or measuring for television systems or their details for television cameras
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/12—Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/154—Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/40—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video transcoding, i.e. partial or full decoding of a coded input stream followed by re-encoding of the decoded output stream
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23418—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234309—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4 or from Quicktime to Realvideo
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234363—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the spatial resolution, e.g. for clients with a lower screen resolution
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/23439—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/24—Monitoring of processes or resources, e.g. monitoring of server load, available bandwidth, upstream requests
- H04N21/2402—Monitoring of the downstream path of the transmission network, e.g. bandwidth available
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/266—Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
- H04N21/2662—Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/44029—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display for generating different versions
Definitions
- aspects and implementations of the disclosure relate to video processing, and more specifically, to optimal format selection for video players based on predicted visual quality.
- Content sharing platforms enable users to upload, consume, search for, approve of (“like”), dislike, and/or comment on content such as videos, images, audio clips, news stories, etc.
- users may upload content (e.g., videos, images, audio clips, etc.) for inclusion in the platform, thereby enabling other users to consume (e.g., view, etc.) the content.
- Most content sharing platforms transcode an original source video from its native encoded format into a commonly available format. Transcoding comprises decoding the source video from the native format into an unencoded representation using a codec for the native format and then encoding the unencoded representation with a codec for the commonly available format. Transcoding can be used to reduce storage requirements, and also to reduce the bandwidth requirements for serving the video to clients.
- a system and methods are disclosed for training a machine learning model (e.g., a neural network, a convolutional neural network (CNN), a support vector machine [SVM], etc.) and using the trained model to process videos.
- a method includes generating training data for a machine learning model to be trained to identify quality scores for a set of transcoded versions of a new video at a set of display resolutions.
- the generating the training data may include generating a plurality of reference transcoded versions of a reference video, obtaining quality scores for frames of the plurality of reference transcoded versions of the reference video, generating a first training input comprising a set of color attributes, spatial attributes, and temporal attributes of the frames of the reference video, and generating a first target output for the first training input, wherein the first target output comprises the quality scores for the frames of the plurality of reference transcoded versions of the reference video.
- the method further includes providing the training data to train the machine learning model on (i) a set of training inputs comprising the first training input and (ii) a set of target outputs comprising the first target output.
- the quality scores includes peak signal-to-noise ratio (PSNR) of the frames.
- the quality scores include video multimethod assessment fusion (VMAF) of the frames.
- the color attributes may include at least one of an RGB or Y value of the frames.
- the spatial attributes may include a Gabor feature filter bank.
- the temporal attributes may include an optical flow.
- the machine learning model is configured to process the new video and generate one or more outputs indicating a quality score for the set of transcoded versions of the new video at the set of display resolutions.
- the plurality of transcoded versions of the reference video may include a transcoding of the reference video at each of a plurality of different video resolutions, transcoding configurations, and the different display resolutions.
- computing devices for performing the operations of the above described methods and the various implementations described herein are disclosed.
- Computer-readable media that store instructions for performing operations associated with the above described methods and the various implementations described herein are also disclosed.
- Figure 1 depicts an illustrative computer system architecture, in accordance with one or more aspects of the disclosure.
- Figure 2 is a block diagram of an example training set generator to create data sets for a machine learning model using a historical video data set, in accordance with one or more aspects of the disclosure.
- Figure 3 is a block diagram illustrating a system for determining video classifications, in accordance with one or more aspects of the disclosure.
- Figure 4 depicts a flow diagram of one example of a method for training a machine learning model, in accordance with one or more aspects of the disclosure.
- Figure 5 depicts a flow diagram of one example of a method for processing videos using a trained machine learning model, in accordance with one or more aspects of the disclosure.
- Figure 6 depicts a flow diagram of one example of a method for processing videos using a trained machine learning model and optimizing format selection using the output of the trained machine learning model at a server device, in accordance with one or more aspects of the disclosure.
- Figure 7 depicts a flow diagram of one example of a method for optimizing format selection using the output of a trained machine learning model at a client device, in accordance with one or more aspects of the disclosure.
- Figures 8A and 8B provide an example graphical representation of output of a trained machine learning model to optimize format selection, according to implementations of the disclosure.
- Figure 9 depicts a block diagram of an illustrative computer system operating in accordance with one or more aspects of the disclosure.
- users may upload content (e.g., videos, images, audio clips, etc.) for inclusion in the platform, thereby enabling other users to consume (e.g., view, etc.) the content.
- content e.g., videos, images, audio clips, etc.
- videos uploaded to the content sharing platforms are transcoded (uncompressed and re compressed) before serving to viewers in order to enhance the viewing experience.
- An uploaded video may have multiple transcoded variants being played at various display resolutions. Resolutions (input, transcoded, and display) can be roughly grouped to canonical industry standard resolutions, such as 360p, 480p, 720p, 1080p, 2160p (4k), and so on.
- a typical transcoding pipeline can generate multiple transcoded versions (also called video formats).
- a media viewer can adaptively select one of those video formats to serve.
- a conventional serving strategy assuming users have enough bandwidth, is to switch to a higher resolution version of a video until reaching the highest available resolution. This is also known as an Adaptive Bit Rate (ABR) strategy.
- ABR Adaptive Bit Rate
- the assumption of such ABR strategy is that higher resolution versions provide a better visual quality.
- the visual quality of the higher resolution version can be very close to the visual quality of the lower resolution version (e.g., when a 480p version of a video has a similar perceptual quality as a 720p version of the video when played back on client device having a 480p display resolution). In such a case, serving the higher resolution version of the video wastes the user’s bandwidth without providing a discernible benefit to the user in terms of perceptual quality.
- One approach to avoid such inefficiencies due to suboptimal format selection is to attach a list of objective quality scores with each transcoded version, with the underlying assumption that each objective quality score reflects the perceptual quality when playing a particular format at a certain display resolution.
- Computing a single quality score for a particular format of a video entails decoding two video streams (i.e., a rescaled transcoded version and the original version), and extracting per frame features in order to calculate the overall quality score. If it is assumed there are ‘N format’ valid transcoded video formats,
- Computing all possible quality scores across the various resolutions utilizes a large amount of computational resources. Furthermore, it may be infeasible to perform such sizable computations on large scale systems (e.g. content sharing platforms having millions of new uploads every day).
- implementations involve training and using an efficient machine learning model to predict objective quality scores for videos compressed with arbitrary compression settings and played on arbitrary display resolutions.
- a set of historical videos is accessed and used to train a machine learning model.
- each of the historical videos is used to generate input features to train the machine learning model.
- the input features include a set of attributes of the frames of a respective historical video.
- the set of attributes can include color attributes (e.g., red/green/blue (RGB) intensity values or Y intensity values of the YUV format), spatial attributes (e.g., Gabor filter), and temporal attributes (e.g., optical flow) of the frame(s) of the respective historical video.
- RGB red/green/blue
- spatial attributes e.g., Gabor filter
- temporal attributes e.g., optical flow
- the plurality of transcoded versions are then rescaled into a plurality of different potential display resolutions of client devices (referred to herein as rescaled transcoded versions).
- a quality score e.g., peak signal-to- noise ratio (PSNR) measurement or video multimethod assessment fusion (VMAF measurement) is obtained for each of the rescaled transcoded versions.
- PSNR peak signal-to- noise ratio
- VMAF measurement video multimethod assessment fusion
- quality scores may then be used as training outputs (e.g., ground truth labels), which are mapped to the training input features discussed above and used to train the machine learning model.
- the machine learning model is trained to generate a predicted quality score for the video at each possible tuple of video resolution format, transcoding configuration, and display resolution.
- video resolution format and “video format” may refer to the resolution of a video prior to rescaling.
- display resolution may refer to the resolution at which the video is actually displayed (e.g., by a media viewer on a client device), after rescaling.
- a new video may be identified for processing by the trained machine learning model.
- the perceptual quality of a constituent video e.g., the video provided for playback to a client device
- the perceptual quality of a constituent video e.g., the video provided for playback to a client device
- a set of attributes e.g., color, spatial, temporal
- the set of attributes of the new video is presented as input to the trained machine learning model, which generates one or more outputs based on the input.
- the outputs are predicted quality scores providing a predicted perceptual quality measurement of the video at each possible tuple of video resolution, transcoding configuration, and display resolution.
- the predicted quality scores may be utilized to optimize format selection at the client device. Particular aspects concerning the training and usage of the machine learning model are described in greater detail below.
- aspects of the disclosure thus provide a mechanism by which predicted quality scores for a video at all possible combinations of video resolution, transcoding configuration, and display resolution can be identified.
- This mechanism allows automated and optimized format selection for playback of a video at a client device having a particular display resolution.
- An advantage of implementations of the disclosure is that the trained machine learning model is able to return multiple objective video quality scores (e.g., PSNR and VMAF values) for all (video format, transcoding config, display resolution) tuples of an input video at once. Implementations avoid the time-consuming processes of transcoding or quality metric computation for each possible transcoded version of an input video.
- the output of the trained machine learning model can be utilized to optimize format selection to maximize user experience of video quality. Optimizing format selection may also have the advantage of reducing bandwidth requirements, without noticeably reducing the video quality perceived by a user.
- FIG. 1 illustrates an illustrative system architecture 100, in accordance with one implementation of the disclosure.
- the system architecture 100 includes one or more server machines 120 through 150, a content repository 110, and client machines 102A-102N connected to a network 104.
- Network 104 may be a public network (e.g., the Internet), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof.
- LAN local area network
- WAN wide area network
- the client machines 102A-102N may be personal computers (PCs), laptops, mobile phones, tablet computers, set top boxes, televisions, video game consoles, digital assistants or any other computing devices.
- the client machines 102A-102N may run an operating system (OS) that manages hardware and software of the client machines 102A- 102N.
- OS operating system
- the client machines 102A-102N may upload videos to the web server (e.g., upload server 125) for storage and/or processing.
- Server machines 120 through 150 may be a rackmount server, a router computer, a personal computer, a portable digital assistant, a mobile phone, a laptop computer, a tablet computer, a camera, a video camera, a netbook, a desktop computer, a media center, or any combination of the above.
- Server machine 120 includes an upload server 125 that is capable of receiving content (e.g., videos, audio clips, images, etc.) uploaded by client machines 102A-102N (e.g., via a webpage, via an application, etc.).
- content e.g., videos, audio clips, images, etc.
- Content repository 110 is a persistent storage that is capable of storing content items as well as data structures to tag, organize, and index the media items.
- Content repository 110 may be hosted by one or more storage devices, such as main memory, magnetic or optical storage based disks, tapes or hard drives, NAS, SAN, and so forth.
- content repository 110 may be a network-attached file server, while in other embodiments content repository 110 may be some other type of persistent storage such as an object-oriented database, a relational database, and so forth, that may be hosted by the server machine 120 or one or more different machines coupled to the server machine 120 via the network 104.
- the content items stored in the content repository 110 may include user-generated media items that are uploaded by client machines, as well as media items from service providers such as news organizations, publishers, libraries and so forth.
- content repository 110 may be provided by a third-party service, while in some other implementations content repository 110 may be maintained by the same entity maintaining server machine 120.
- content repository 110 and server machine 120-150 may be part of a content sharing platform that allows users to upload, consume, search for, approve of (“like”), dislike, and/or comment on media items.
- the content sharing platform may include multiple channels.
- a channel can be data content available from a common source or data content having a common topic, theme, or substance.
- the data content can be digital content chosen by a user, digital content made available by a user, digital content uploaded by a user, digital content chosen by a content provider, digital content chosen by a broadcaster, etc.
- a channel can be associated with an owner, who is a user that can perform actions on the channel. Different activities can be associated with the channel based on the owner’s actions, such as the owner making digital content available on the channel, the owner selecting (e.g., liking) digital content associated with another channel, the owner commenting on digital content associated with another channel, etc.
- the activities associated with the channel can be collected into an activity feed for the channel. Users, other than the owner of the channel, can subscribe to one or more channels in which they are interested.
- the concept of “subscribing” may also be referred to as “liking”, “following”, “friending”, and so on.
- Each channel may include one or more media items.
- media items can include, and are not limited to, digital video, digital movies, digital photos, digital music, website content, social media updates, electronic books (ebooks), electronic magazines, digital newspapers, digital audio books, electronic journals, web blogs, real simple syndication (RSS) feeds, electronic comic books, software applications, etc.
- media items are also referred to as a video content item.
- Media items may be consumed via media viewers 105 executing on client machines 102A-102N.
- the media viewers 105 may be applications that allow users to view content, such as images, videos (e.g., video content items), web pages, documents, etc.
- the media viewers 105 may be a web browser that can access, retrieve, present, and/or navigate content (e.g., web pages such as Hyper Text Markup Language (HTML) pages, digital media items or content items, etc.) served by a web server.
- the media viewers 105 may render, display, and/or present the content (e.g., a web page, a media viewer) to a user.
- content e.g., a web page, a media viewer
- the media viewers 105 may also display an embedded media player (e.g., a Flash® player or an HTML5 player) that is embedded in a web page (e.g., a web page that may provide information about a product sold by an online merchant).
- an embedded media player e.g., a Flash® player or an HTML5 player
- a web page e.g., a web page that may provide information about a product sold by an online merchant.
- the media viewers 105 may be a standalone application (e.g., a mobile application) that allows users to view digital media content items (e.g., digital videos, digital images, electronic books, etc.).
- the media viewers 105 may be provided to the client devices 102 A through 102N by the server 120 and/or content sharing platform.
- the media viewers 105 may be embedded media players that are embedded in web pages provided by the content sharing platform.
- the media viewers may be applications that communicate with the server 120 and/or the content sharing platform.
- Server machine 130 includes a training set generator 131 that is capable of generating training data (e.g., a set of training inputs and target outputs) to train such a machine learning model.
- training set generator 131 Some operations of training set generator 131 are described in detail below with respect to Figure 2.
- Server machine 140 includes a training engine 141 that is capable of training a machine learning model 160.
- the machine learning model 160 may refer to the model artifact that is created by the training engine 141 using the training data that includes training inputs and corresponding target outputs (correct answers for respective training inputs).
- the training engine 141 may find patterns in the training data that map the training input to the target output (the answer to be predicted), and provide the machine learning model 160 that captures these patterns.
- the machine learning model may be composed of, e.g., a single level of linear or non-linear operations (e.g., a support vector machine [SVM] or may be a deep network, i.e., a machine learning model that is composed of multiple levels of non-linear operations).
- SVM support vector machine
- An example of a deep network is a neural network with one or more hidden layers, and such machine learning model may be trained by, for example, adjusting weights of a neural network in accordance with a backpropagation learning algorithm or the like.
- the remainder of this disclosure refers to the implementation as a neural network, even though some implementations might employ an SVM or other type of learning machine instead of, or in addition to, a neural network.
- a convolutional neural network such as ResNet or EfficientNet
- Other machine learning models may be considered in implementations of the disclosure.
- the training set is obtained from server machine 130.
- Server machine 150 includes a quality score engine 151 and a format analysis engine 152.
- the quality score engine 151 is capable of providing attribute data of frames of a video as input to trained machine learning model 160 and running trained machine learning model 160 on the input to obtain one or more outputs.
- format analysis engine 152 is also capable of extracting quality score data from the output of the trained machine learning model 160 and using the quality score data to perform optimal format selection for the video.
- format analysis engine 152 may be provided by media viewers 105 at client devices 102A-102N based on the quality score data obtained by quality score engine 151.
- server machines 120, 130, 140, and 150 may be provided by fewer machines.
- server machines 130 and 140 may be integrated into a single machine, while in other implementations server machines 130, 140, and 150 may be integrated into a single machine.
- server machines 130, 140, and 150 may be integrated into the content sharing platform.
- functions described in one implementation as being performed by the content item sharing platform, server machine 120, server machine 130, server machine 140, and/or server machine 150 can also be performed on the client devices 102A through 102N in other implementations, if appropriate.
- the functionality attributed to a particular component can be performed by different or multiple components operating together.
- the content sharing platform, server machine 120, server machine 130, server machine 140, and/or server machine 150 can also be accessed as a service provided to other systems or devices through appropriate application programming interfaces, and thus is not limited to use in websites.
- Figure 2 is an example training set generator 231 (e.g., training set generator 131 of FIG. 1) to create data sets for a machine learning model (e.g., model 160 of FIG. 1) using historical video data set 240, according to certain embodiments.
- System 200 of FIG. 2 shows training set generator 231, training input features 210, and target output labels 220.
- training set generator 231 generates a data set (e.g., training set, validating set, testing set) that includes one or more training input features 210 (e.g., training input, validating input, testing input) and one or more target output labels 220 that correspond to the training input features 210.
- the data set may also include mapping data that maps the training input features 210 to the target output labels 220.
- Training input features 210 may also be referred to as “data input,” “features,” “attributes,” or “information.”
- training set generator 131 may provide the data set to the training engine 141 of FIG. 1, where the data set is used to train, validate, or test the machine learning model 160.
- training input features 210 may include one or more of historical video color attributes 242, historical video spatial attributes 244, historical video temporal attributes 246, etc.
- Target output labels 220 may include video classifications 248.
- the video classifications 248 may include or be associated with video quality score measurement algorithms.
- training set generator 231 may generate data input corresponding to the set of features (e.g., one or more historical video color attributes 242, historical video spatial attributes 244, historical video temporal attributes 246) to train, validate, or test a machine learning model for each input resolution of a set of possible input resolutions of videos.
- each typical resolution of a video e.g. 360p, 480p, 720p, 1080p, etc.
- any non-standard arbitrary input resolutions for instance 1922x1084
- a first machine learning model may be trained, validated, and tested for input videos having 360p resolution.
- a second machine learning model may be trained, validated, and tested for input videos having 480p resolution.
- a third machine learning model may be trained, validated, and tested for input videos having 720p resolution, and so on.
- Training set generator 231 may utilize a set of historical videos 240 to train, validate, and test the machine learning model(s).
- an existing data set of the content sharing platform may be curated and utilized as historical video data set 240 specifically for the purposes of training, validating, and testing machine learning models.
- historical video data set 240 may include multiple (e.g., in the order of the thousands) videos of short duration (e.g., 20 second, etc.) in various input resolutions (e.g., 360p, 480p, 720p, 1080p, 2160p (4K), and so on).
- the data (e.g., videos) in the historical video data set 240 may be divided into training data and testing data. For example, the videos of historical video data set 240 may be randomly split into 80% and 20% for training and testing, respectively.
- the training set generator 231 may iterate through the following process on each of the videos in the historical video data set 240. For ease of reference, the process is described with respect to a first reference video of the historical video data set 240. It should be understood that a similar process may be performed by training set generator 231 on all videos in the historical video data set 240.
- the training set generator 231 obtains the training input features 210 of the first reference video.
- a set of video color attributes 242, video spatial attributes 244, and video temporal attributes 246 are extracted for each frame of the first reference video.
- the video color attribute 242 may refer to at least one of RGB intensity values or Y intensity values (of the Y’UV model) of the pixels of the frame.
- the video spatial attributes 244 may refer to a Gabor filter feature bank.
- a Gabor filter is a linear filter used for texture analysis, which analyzes whether there are any specific frequency content in the frame in specific directions in a localized region around a point or region of analysis.
- a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave.
- Implementations of the disclosure may also utilize other spatial features of the frame, such as block-based features (e.g., SSM, VMS, etc.).
- the video temporal attributes 246 may refer to optical flow. Optical flow refers to the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and the scene. Optical flow may be defined as the distribution of apparent velocities of movement of brightness pattern in an image. Implementations of the disclosure may also utilize other temporal features of the frame, such as computing a difference between pixels of neighboring frames.
- the extracted set of historical video color attributes 242, historical video spatial attributes 244, and historical video temporal attributes 246 for each frame in the first reference video are then combined as a first set of training input features for the first reference video.
- Other sets of features from the frames of the input video may also be considered in implementations of the disclosure.
- the training set generator 231 may transcode the first reference video into a plurality of valid video formats (e.g., 360p, 480p, 720p, 1080p, 21260p (4K), and so on), with a plurality of transcoding configurations (e.g. sweeping Constant Rate Factor (CRF) from 0 to 51 with an H.264 encoder).
- a plurality of transcoding configurations e.g. sweeping Constant Rate Factor (CRF) from 0 to 51 with an H.264 encoder.
- CRF Constant Rate Factor
- Other codecs and encoders could be used in other implementations of the disclosure, such as the VP9 codec.
- the training set generator 231 then rescales all of the transcoded versions to a plurality of display resolutions (e.g., 360p, 480p, 720p, 1080p,
- Each rescaled transcoded version and the original version are provided to a quality analyzer as source and reference to obtain quality scores for each frame of each rescaled transcoded version and the original version of the first reference video.
- the quality score may be a PSNR measurement for each frame of each rescaled transcoded version and the original version of the first reference video.
- PSNR refers to the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation.
- PSNR may be expressed in terms of the logarithmic decibel scale.
- the quality score may be a VMAF measurement for each frame of each rescaled transcoded version and the original version of the first reference video. VMAF refers to a full-reference video quality metric that predicts subject video quality based on a reference and distorted video sequence.
- the computed quality scores are used as video classification (e.g., input ground- truth labels) 248 for the target output labels 220 for each possible video format, transcoding configuration, and display resolution of the first reference video.
- FIG. 3 is a block diagram illustrating a system 300 for determining video classifications 346, according to certain embodiments.
- the system 300 performs data partitioning (e.g., via training set generator 131 of server machine 130 of FIG. 1 and/or training set generator 231 of FIG. 2) of the historical videos 342 (e.g., historical video data set 240 of FIG. 2) to generate the training set 302, validation set 304, and testing set 306.
- the training set may be 60% of the historical videos 342
- the validation set may be 20% of the historical videos 342
- the validation set may be 20% of the historical videos 342.
- the system 300 may generate a plurality of sets of features (e.g., color attributes, spatial attributes, temporal attributes) for each of the training set, the validation set, and the testing set.
- features e.g., color attributes, spatial attributes, temporal attributes
- the system 300 performs model training (e.g., via training engine 141 of FIG. 1) using the training set 302.
- the system 300 may train multiple models using multiple sets of features of the training set 302 (e.g., a first set of features of the training set 302, a second set of features of the training set 302, etc.).
- system 300 may train a machine learning model to generate a first trained machine learning model using the first set of features in the training set (e.g., for a first input resolution, such as 360p) and to generate a second trained machine learning model using the second set of features in the training set (e.g., for a second input resolution, such as 480p).
- CNN Convolutional Neural Network
- the system 300 performs model validation using the validation set 304.
- the system 300 may validate each of the trained models using a corresponding set of features of the validation set 304.
- system 300 may validate the first trained machine learning model using the first set of features in the validation set (e.g., for 360p input resolution) and the second trained machine learning model using the second set of features in the validation set (e.g., for 480p input resolution).
- the system 300 may validate numerous models (e.g., models with various permutations of features, combinations of models, etc.) generated at block 312.
- the system 300 may determine an accuracy of each of the one or more trained models (e.g., via model validation) and may determine whether one or more of the trained models has an accuracy that meets a threshold accuracy.
- a loss function may be utilized that is defined based on absolute difference between predicted quality and ground truth. Higher order
- the system 300 performs model selection to determine which of the one or more trained models that meet the threshold accuracy has the highest accuracy (e.g., the selected model 308, based on the validating of block 314). Responsive to determining that two or more of the trained models that meet the threshold accuracy have the same accuracy, flow may return to block 312 where the system 300 performs model training using further refined training sets corresponding to further refined sets of features for determining a trained model that has the highest accuracy.
- the system 300 performs model testing using the testing set 306 to test the selected model 308.
- the system 300 may test, using the first set of features in the testing set, the first trained machine learning model to determine the first trained machine learning model meets a threshold accuracy (e.g., based on the first set of features of the testing set 306). Responsive to accuracy of the selected model 308 not meeting the threshold accuracy (e.g., the selected model 308 is overly fit to the training set 302 and/or validation set 304 and not applicable to other data sets such as the testing set 306), flow continues to block 312 where the system 300 performs model training (e.g., retraining) using different training sets corresponding to different sets of features.
- model training e.g., retraining
- the model may learn patterns in the historical images 342 to make predictions and in block 318, the system 300 may apply the model on the remaining data (e.g., testing set 306) to test the predictions.
- system 300 uses the trained model (e.g., selected model 308) to receive input videos 348 and corresponding video attributes 344 and extracts, from the output of the trained model, a corresponding video classification 346, representing a predicted quality score of the input video at each possible combination of the video format, transcoding configuration, and display resolution tuple.
- the trained model may be used to predict per frame quality scores for each sampled frame, then those frame scores may be aggregated to get the overall score for the input video at the particular video format, transcoding configuration, and display resolution tuple.
- the aggregation method could also be varied, i.e., mean to weighted mean, etc.
- the final output is a 3D quality matrix for all possible index tuples (video format index, transcoding config index, display resolution index).
- FIGS. 8A and 8B show predicted VMAF scores for a video transcoded into 360p, 480p, and 720p versions, and displayed at 480p and 720p, where the various scores predicted by the trained model are very close to the corresponding ground truth scores.
- the machine learning model 160 may be further trained, validated, tested (e.g., using manually determined image classifications, etc.), or adjusted (e.g., adjusting weights associated with input data of the machine learning model 160, such as connection weights in a neural network).
- one or more of the acts 310-320 may occur in various orders and/or with other acts not presented and described herein. In some embodiments, one or more of acts 310-320 may not be performed. For example, in some embodiments, one or more of data partitioning of block 310, model validation of block 314, model selection of block 316, or model testing of block 318 may not be performed.
- the trained model may be combined with temporal sampling of the input video.
- the system may utilize a few sampled frames (rather than all frames) of the input video as input to the trained model to predict the overall quality.
- significant speed improvements may be realized with little to no loss in accuracy. These speed gains can be obtained without any re-training of the model.
- the video classifications 346 obtained from using the trained model 320 may be used to optimize format selection on the client side (e.g., at a media viewer such as media viewer 105 of FIG. 1). For the example shown in FIG.
- the media viewer may blindly move to the 720p version, which utilizes more bandwidth but would not improve the actual watching experience (e.g., perceptual quality as measured by, for example, PSNR or VMAF).
- perceptual quality as measured by, for example, PSNR or VMAF.
- the display resolution is 720p (FIG. 8B)
- moving up to 720p improves the actual perceptual quality.
- Another advantage of implementations of the disclosure is that it allows precise quality evaluation (e.g., at the CRF level). For example, the media viewer can calculate the exact quality improvement on 480p display resolution for switching from 480p CRF 40 version to 720p CRF 30 version, to further optimize the format selection.
- Figure 4 depicts a flow diagram of one example of a method 400 for training a machine learning model, in accordance with one or more aspects of the disclosure.
- the method is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination thereof.
- the method is performed by computer system 100 of Figure 1, while in some other implementations, one or more blocks of Figure 4 may be performed by one or more other machines not depicted in the figures.
- one or more blocks of Figure 4 may be performed by training set generator 131 of server machine 130 of FIG. 1 and/or training set generator 231 of FIG. 2.
- Method 400 begins with generating training data for a machine learning model.
- a training set T is initialized to an empty set.
- a reference video is transcoded into a plurality of resolution formats with a plurality of transcoding configurations.
- the plurality of resolution formats comprise 360p, 480, 720p, 1080p, 2160p (4K), and so on.
- the plurality of transcoding configurations may include CRF from 0 to 51 with an H.264 encoder. Other codecs and encoders could be used in implementations of the disclosure for the transcoding configurations, such as the VP9 codec.
- the reference video may be a video having 480p resolution, that is transcoded into each resolution format of 360p, 480p, 720p, 1080p, and 2160p, at each transcoding configuration of the H.264 encoder of CRF 0 to 51.
- the 480p reference video can be transcoded into a 720p CRF 0 version through a 720p CRF 51 version, and so on, for all resolution formats and transcoding configurations, resulting in a plurality of different transcoded versions of the reference video.
- each transcoded version of the reference video generated at block 402 is rescaled to a plurality of different display resolutions.
- the plurality of different display resolutions represent the possible display resolutions of client devices playing back the transcoded versions of the reference video.
- the plurality of different display resolutions may include 360p, 480p, 720p, 1080p, 2160p, and so on.
- the 720p CRF 30 version can be rescaled to each display resolution of 360p, 480p, 720p, 1080p, and 2160p to generate rescaled transcoded versions of the 480p reference video. This rescaling is performed for each transcoded version of the reference video generated at block 402 to generate a plurality of rescaled transcoded versions of the reference video.
- quality scores for each frame of each rescaled transcoded version of the reference video are obtained.
- the quality scores may be a PSNR score or a VMAF score determined for the frames of the rescaled transcoded versions of the reference video.
- PSNR refers to the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation.
- PSNR may be expressed in terms of the logarithmic decibel scale.
- VMAF refers to a full -reference video quality metric that predicts subject video quality based on a reference and distorted video sequence.
- an input/output mapping is generated for each frame of the reference video.
- the input/output mapping refers to the training input mapped to particular target outputs.
- the training input includes or is based on a set of attributes of the reference video.
- the target output for the training input identifies the quality scores for the frames of the rescaled transcoded versions of the reference video.
- the training input is associated with (or mapped to) the target output,.
- the set of attributes of the reference video includes color attributes (e.g., RGB intensity values or Y intensity values of each frame), spatial attributes (e.g., Gabor filter feature of each frame), and temporal attributes (e.g., optical flow of each frame).
- the set of attributes of each frame of the reference video is mapped as input to the quality scores of each frame of each rescaled transcoded version of the reference video.
- the input/output mapping generated at block 405 is added to training set T.
- Block 407 branches based on whether training set T is sufficient for training machine learning model 160. If so, execution proceeds to block 408, otherwise, execution continues back at block 402. It should be noted that in some implementations, the sufficiency of training set T may be determined based on the number of input/output mappings in the training set, while in some other implementations, the sufficiency of training set T may be determined based on one or more other criteria (e.g., a measure of diversity of the training examples, etc.) in addition to, or instead of, the number of input/output mappings.
- the sufficiency of training set T may be determined based on the number of input/output mappings in the training set, while in some other implementations, the sufficiency of training set T may be determined based on one or more other criteria (e.g., a measure of diversity of the training examples, etc.) in addition to, or instead of, the number of input/output mappings.
- training set T is provided to train machine learning model 160.
- training set T is provided to training engine 141 of server machine 140 to perform the training.
- a neural network e.g., CNN
- input values of a given input/output mapping e.g., pixel values of a training image, etc.
- output values of the input/output mapping are stored in the output nodes of the neural network.
- the connection weights in the neural network are then adjusted in accordance with a learning algorithm (e.g., backpropagation, etc.), and the procedure is repeated for the other input/output mappings in training set T.
- machine learning model 160 can then be used to process videos (for example, in accordance with method 500 of Figure 5, described below).
- Figure 5 depicts a flow diagram of one example of a method 500 for processing videos using a trained machine learning model, in accordance with one or more aspects of the disclosure.
- the method is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination thereof.
- the method is performed using the server machine 150 and trained machine learning model 160 of Figure 1, while in some other implementations, one or more blocks of Figure 5 may be performed by one or more other machines not depicted in the figures.
- Method 500 may include receiving an input video (e.g., from a user device or a server such as upload server 125) and processing the input video using a trained model such as trained machine learning model 160.
- the trained model may be configured to generate, based on pixel data of frames of the input video, one or more outputs indicating a predicted quality score of the input video.
- a video may be identified for processing.
- the video includes one or more frames of an uploaded video (e.g., a video uploaded to a content sharing platform).
- a subset of frames are extracted from the frames of the video.
- one out of every 10 frames in the video may be extracted.
- a set of attributes of the frame is identified.
- the set for attributes may include color attributes (e.g., RGB intensity values or Y intensity values of each frame), spatial attributes (e.g., Gabor filter feature of each frame), and temporal attributes (e.g., optical flow of each frame).
- the set of attributes is provided as input to the trained machine learning model.
- the trained machine learning model is trained using method 400 described with respect to FIG. 4.
- one or more outputs are obtained from the trained machine learning model.
- the outputs from the trained machine learning model are quality scores corresponding to each extracted frame of the video.
- the trained machine learning model generates a plurality of quality scores each corresponding to all possible index tuples for video resolution format, transcoding configuration, and display resolution of the input video.
- the per frame quality scores for the extracted frames of the video are combined to generate an overall quality score for the video.
- An overall quality score is generated for each of the possible index tuples of video resolution format, transcoding configuration, and display resolution.
- combining the per frame quality scores includes aggregating the scores.
- the aggregation process could be varied and can include mean to weighted mean, and so on.
- a final output of the trained machine learning model is generated, where the final output is a 3D quality matrix for all possible index tuples of video resolution format, transcoding configuration, and display resolution.
- Figure 6 depicts a flow diagram of one example of a method 600 for processing videos using a trained machine learning model and optimizing format selection using the output of the trained machine learning model at a server device, in accordance with one or more aspects of the disclosure.
- the method is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination thereof.
- the method is performed using the server machine 150 and trained machine learning model 160 of Figure 1, while in some other implementations, one or more blocks of Figure 6 may be performed by one or more other machines not depicted in the figures.
- Method 600 may include receiving an input video (e.g., from a user device or a server such as upload server 125) and processing the input video using a trained model such as trained machine learning model 160.
- the trained model may be configured to generate, based on pixel data of frames of the input video, one or more outputs indicating a predicted quality score of the input video.
- a video may be identified for processing.
- the video includes one or more frames of an uploaded video (e.g., a video uploaded to a content sharing platform).
- a set of attributes of the video is identified.
- the set of attributes is determined for each of a subset of frames of the video.
- the set for attributes may include color attributes (e.g., RGB intensity values or Y intensity values of each frame of the video), spatial attributes (e.g., Gabor filter feature of each frame of the video), and temporal attributes (e.g., optical flow of each frame of the video).
- the set of attributes is provided as input to the trained machine learning model.
- the trained machine learning model is trained using method 400 described with respect to FIG. 4.
- one or more outputs are obtained from the trained machine learning model.
- the outputs from the trained machine learning model are quality scores corresponding to each extracted frame of the video.
- the trained machine learning model generates a plurality of quality scores each corresponding to all possible index tuples for video resolution format, transcoding configuration, and display resolution of the video.
- the per frame quality scores for the extracted frames of the video are combined to generate an overall quality score for the video.
- An overall quality score is generated for each of the possible index tuples of video resolution format, transcoding configuration, and display resolution.
- combining the per frame quality scores includes aggregating the scores.
- the aggregation process could be varied and can include mean to weighted mean, and so on.
- quality score data is extracted from the outputs obtained at block 604.
- the quality score data includes quality scores corresponding to each rescaled transcoded version of the video.
- the quality scores each reflect a perceptual quality of the video at the respective rescaled transcoded configuration of the video at a particular display resolution.
- the extracted quality score is used to determine a resolution format to select at a media viewer of client device.
- the format analysis engine 152 (at server machine 150 of FIG. 1 and/or at media viewer 105 of client device 102 A) may use the extracted quality score to select a resolution format and transcoding configuration at the client device having a particular display resolution.
- the format analysis engine 152 may compare the quality score to inform an optimal resolution format and transcoding configuration selection for the client device.
- Figure 7 depicts a flow diagram of one example of a method 700 for optimizing format selection using the output of a trained machine learning model at a client device, in accordance with one or more aspects of the disclosure.
- the method is performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination thereof.
- the method is performed using client 102A-102N of Figure 1, while in some other implementations, one or more blocks of Figure 7 may be performed by one or more other machines not depicted in the figures.
- Method 700 may include receiving, at block 701, an input video (e.g., at a client device 102A-10N from a server such as upload server 125) for playback at a media viewer of the client device, where the client device has a particular display resolution.
- the client device performs playback of the video at a current format selection.
- the current format selection may include a current video resolution and a current transcoding configuration.
- quality scores corresponding to the video are accessed. The quality scores are generated using a trained model, such as trained machine learning model 160 of Figure 1.
- the trained model may be configured to generate, based on pixel data of frames of the input video, one or more outputs indicating predicted quality scores of the input video at various tuples of video resolutions, transcoding configurations, and display resolution.
- the outputs may be maintained at the server device (such as server machine 150) or may be provided as metadata with the video when the video is received for playback.
- the trained machine learning model may be trained according to method 400 of Fig. 4 described above.
- decision block 704 it is determined whether the quality scores indicate a perceptual improvement at a different format selection than the current format selection (e.g., at a different video resolution and/or transcoding configuration) for the particular display resolution of the client device.
- FIGS. 8A and 8B discussed below provide an example of using quality scores of the trained machine learning model of implementations of the disclosure to optimize format selection at a client device. If the quality scores indicate a perceptual improvement at a different format selection, then method 700 proceeds to block 705, where the current format selection for playback of the video is changed to the format selection (e.g., the video resolution and/or the transcoding configuration) corresponding to the quality score that provides a perceptual quality improvement of the video at the display resolution of the client.
- the format selection e.g., the video resolution and/or the transcoding configuration
- Method 700 then returns to block 702 to continue playback of the video at the new format selection. If the quality scores do not indicate a perceptual quality improvement at a different format selection at decision block 704, the method 700 returns to block 702 to continue playback of the video at the current format selection.
- FIGS. 8A and 8B provide an example graphical representation of output of the trained machine learning model to optimize format selection, according to implementations of the disclosure.
- FIGS. 8A and 8B show predicted VMAF scores for a video transcoded into 360p, 480p and 720p versions, and displayed at 480p and 720p, where the various scores predicted by the trained model are very close to the corresponding ground truth scores.
- the graph shows predicted quality scores (e.g., VMAF) when the display resolution is 480p, and the current version being played is 360p, assuming the bandwidth is sufficient for all transcoded versions.
- Per predicted quality scores the perceptual quality of 480p and 720p versions are very close, and both of them are significantly higher than 360p version.
- the optimal format is 480p transcoded version, instead of 720p.
- the media viewer blindly moves up to 720p version, which utilizes more bandwidth, but would not improve the actual watching experience (e.g., perceptual quality) of the video.
- the display resolution is 720p (FIG. 8B)
- 720p sees the difference between 480p and 720p is greater than 6 points on VMAF (which roughly translates to 1 step higher in the just noticeable differences)
- moving up to 720p improves the actual perceptual quality of the video.
- the media viewer in this case should move up to the 720p version to improve the perceptual quality of the video during playback.
- Another advantage of implementations of the disclosure is that it allows precise quality evaluation (CRF level). For example, as shown in FIGS. 8A and 8B, the media viewer can calculate the exact quality improvement on 480p display resolution for switching from 480p CRF 40 version to 720p CRF 30 version, to further optimize the format selection.
- Figure 9 depicts a block diagram of an illustrative computer system 900 operating in accordance with one or more aspects of the disclosure.
- computer system 900 may correspond to a computing device within system architecture 100 of Figure 1.
- computer system 900 may be connected (e.g., via a network 630, such as a Local Area Network (LAN), an intranet, an extranet, or the Internet) to other computer systems.
- Computer system 900 may operate in the capacity of a server or a client computer in a client-server environment, or as a peer computer in a peer-to-peer or distributed network environment.
- LAN Local Area Network
- Computer system 900 may be provided by a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, switch or bridge, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device.
- PC personal computer
- PDA Personal Digital Assistant
- STB set-top box
- a cellular telephone a web appliance
- server a server
- network router switch or bridge
- the computer system 900 may include a processing device 902, a volatile memory 904 (e.g., random access memory (RAM)), a non-volatile memory 906 (e.g., read-only memory (ROM) or electrically-erasable programmable ROM (EEPROM)), and a data storage device 916, which may communicate with each other via a bus 908.
- a volatile memory 904 e.g., random access memory (RAM)
- non-volatile memory 906 e.g., read-only memory (ROM) or electrically-erasable programmable ROM (EEPROM)
- EEPROM electrically-erasable programmable ROM
- Processing device 902 may be provided by one or more processors such as a general purpose processor (such as, for example, a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a microprocessor implementing other types of instruction sets, or a microprocessor implementing a combination of types of instruction sets) or a specialized processor (such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), or a network processor).
- CISC complex instruction set computing
- RISC reduced instruction set computing
- VLIW very long instruction word
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- DSP digital signal processor
- Computer system 900 may further include a network interface device 922.
- Computer system 900 also may include a video display unit 910 (e.g., an LCD), an alphanumeric input device 912 (e.g., a keyboard), a cursor control device 914 (e.g., a mouse), and a signal generation device 920.
- a video display unit 910 e.g., an LCD
- an alphanumeric input device 912 e.g., a keyboard
- a cursor control device 914 e.g., a mouse
- Data storage device 916 may include a transitory or non-transitory computer- readable storage medium 924 on which may store instructions 926 encoding any one or more of the methods or functions described herein, including instructions for implementing methods 400-700 of Figures 4 through 7, respectively.
- Instructions 926 may also reside, completely or partially, within volatile memory 904 and/or within processing device 902 during execution thereof by computer system 900, hence, volatile memory 904 and processing device 902 may also constitute machine-readable storage media.
- computer-readable storage medium 924 is shown in the illustrative examples as a single medium, the term “computer-readable storage medium” shall include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of executable instructions.
- the term “computer-readable storage medium” shall also include any tangible medium that is capable of storing or encoding a set of instructions for execution by a computer that cause the computer to perform any one or more of the methods described herein.
- the term “computer- readable storage medium” shall include, but not be limited to, solid-state memories, optical media, and magnetic media.
- the methods, components, and features described herein may be implemented by discrete hardware components or may be integrated in the functionality of other hardware components such as ASICS, FPGAs, DSPs or similar devices.
- the methods, components, and features may be implemented by component modules or functional circuitry within hardware devices.
- the methods, components, and features may be implemented in any combination of hardware devices and computer program components, or in computer programs.
- Examples described herein also relate to an apparatus for performing the methods described herein.
- This apparatus may be specially constructed for performing the methods described herein, or it may comprise a general purpose computer system selectively programmed by a computer program stored in the computer system.
- a computer program may be stored in a computer-readable tangible storage medium.
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Discrete Mathematics (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Analysis (AREA)
- Television Receiver Circuits (AREA)
Abstract
Description
Claims
Priority Applications (10)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201980103334.2A CN114982227B (en) | 2019-12-31 | 2019-12-31 | A method, device and medium for format selection |
| JP2022540529A JP7451716B2 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on expected visual quality |
| EP19845753.3A EP3864839A1 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| KR1020227026332A KR102663852B1 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| US17/790,102 US12356027B2 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| PCT/US2019/069055 WO2021137856A1 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| BR112022012563A BR112022012563A2 (en) | 2019-12-31 | 2019-12-31 | IDEAL FORMAT SELECTION FOR VIDEO PLAYERS BASED ON EXPECTED VISUAL QUALITY USING MACHINE LEARNING |
| KR1020247014698A KR20240065323A (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| JP2024032960A JP7622271B2 (en) | 2019-12-31 | 2024-03-05 | Optimal Format Selection for Video Players Based on Predicted Visual Quality Using Machine Learning |
| US19/262,042 US20250337969A1 (en) | 2019-12-31 | 2025-07-07 | Optimal format selection for video players based on predicted visual quality using machine learning |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2019/069055 WO2021137856A1 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
Related Child Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/790,102 A-371-Of-International US12356027B2 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
| US19/262,042 Continuation US20250337969A1 (en) | 2019-12-31 | 2025-07-07 | Optimal format selection for video players based on predicted visual quality using machine learning |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021137856A1 true WO2021137856A1 (en) | 2021-07-08 |
Family
ID=69376004
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2019/069055 Ceased WO2021137856A1 (en) | 2019-12-31 | 2019-12-31 | Optimal format selection for video players based on predicted visual quality using machine learning |
Country Status (7)
| Country | Link |
|---|---|
| US (2) | US12356027B2 (en) |
| EP (1) | EP3864839A1 (en) |
| JP (2) | JP7451716B2 (en) |
| KR (2) | KR102663852B1 (en) |
| CN (1) | CN114982227B (en) |
| BR (1) | BR112022012563A2 (en) |
| WO (1) | WO2021137856A1 (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115002520A (en) * | 2022-04-14 | 2022-09-02 | 百果园技术(新加坡)有限公司 | Video stream data processing method, device, equipment and storage medium |
| CN115174919A (en) * | 2022-09-05 | 2022-10-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and medium |
| US11606553B1 (en) * | 2022-07-15 | 2023-03-14 | RiversideFM, Inc. | Hybrid media recording |
| US20230140605A1 (en) * | 2020-12-21 | 2023-05-04 | Sling TV L.L.C. | Systems and methods for automated evaluation of digital services |
| CN116264606A (en) * | 2021-12-14 | 2023-06-16 | 戴尔产品有限公司 | Method, apparatus and computer program product for processing video |
| CN116416483A (en) * | 2021-12-31 | 2023-07-11 | 戴尔产品有限公司 | Computer-implemented method, apparatus and computer program product |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12333741B2 (en) * | 2020-01-28 | 2025-06-17 | Imax Corporation | No-reference visual media assessment combining deep neural networks and models of human visual system and video content/distortion analysis |
| EP4409900A4 (en) * | 2021-09-29 | 2024-12-25 | Telefonaktiebolaget LM Ericsson (publ) | EFFICIENT TRANSMISSION OF DECODING INFORMATION |
| US12367661B1 (en) * | 2021-12-29 | 2025-07-22 | Amazon Technologies, Inc. | Weighted selection of inputs for training machine-trained network |
| CN115396664A (en) * | 2022-08-19 | 2022-11-25 | 上海哔哩哔哩科技有限公司 | Video quality evaluation method, device, storage medium and computer system |
| US12456179B2 (en) * | 2022-09-30 | 2025-10-28 | Netflix, Inc. | Techniques for generating a perceptual quality model for predicting video quality across different viewing parameters |
| US20250322714A1 (en) * | 2023-05-10 | 2025-10-16 | Sierra Artificial Neural Networks | Systems and methods for artificial intelligence driven casino-style game analysis |
| US12536864B2 (en) * | 2023-05-10 | 2026-01-27 | Sierra Artificial Neural Networks | Systems and methods for slot machine game development utilizing graphic-based artificial intelligence game design systems |
| US12499527B2 (en) | 2023-10-04 | 2025-12-16 | Akamai Technologies, Inc. | Reference-based video quality analysis-as-a-service (VQAaaS) for over-the-top (OTT) streaming |
| EP4611373A1 (en) * | 2024-02-27 | 2025-09-03 | Innowireless Co., Ltd. | Method for predicting original resolution of video contents |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016183011A1 (en) * | 2015-05-11 | 2016-11-17 | Netflix, Inc. | Techniques for predicting perceptual video quality |
| US20170109584A1 (en) * | 2015-10-20 | 2017-04-20 | Microsoft Technology Licensing, Llc | Video Highlight Detection with Pairwise Deep Ranking |
| WO2019117864A1 (en) * | 2017-12-12 | 2019-06-20 | Google Llc | Transcoding media content using an aggregated quality score |
| US20190246112A1 (en) * | 2018-02-07 | 2019-08-08 | Netflix, Inc. | Techniques for predicting perceptual video quality based on complementary perceptual quality models |
| US20190258902A1 (en) * | 2018-02-16 | 2019-08-22 | Spirent Communications, Inc. | Training A Non-Reference Video Scoring System With Full Reference Video Scores |
| US20190295242A1 (en) * | 2018-03-20 | 2019-09-26 | Netflix, Inc. | Quantifying encoding comparison metric uncertainty via bootstrapping |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8611585B2 (en) * | 2008-04-24 | 2013-12-17 | GM Global Technology Operations LLC | Clear path detection using patch approach |
| US10827185B2 (en) * | 2016-04-07 | 2020-11-03 | Netflix, Inc. | Techniques for robustly predicting perceptual video quality |
| US20180121733A1 (en) | 2016-10-27 | 2018-05-03 | Microsoft Technology Licensing, Llc | Reducing computational overhead via predictions of subjective quality of automated image sequence processing |
| US10798387B2 (en) * | 2016-12-12 | 2020-10-06 | Netflix, Inc. | Source-consistent techniques for predicting absolute perceptual video quality |
| CN108235001B (en) | 2018-01-29 | 2020-07-10 | 上海海洋大学 | Deep sea video quality objective evaluation method based on space-time characteristics |
| US10887602B2 (en) | 2018-02-07 | 2021-01-05 | Netflix, Inc. | Techniques for modeling temporal distortions when predicting perceptual video quality |
| US11483472B2 (en) * | 2021-03-22 | 2022-10-25 | International Business Machines Corporation | Enhancing quality of multimedia |
-
2019
- 2019-12-31 WO PCT/US2019/069055 patent/WO2021137856A1/en not_active Ceased
- 2019-12-31 BR BR112022012563A patent/BR112022012563A2/en unknown
- 2019-12-31 CN CN201980103334.2A patent/CN114982227B/en active Active
- 2019-12-31 JP JP2022540529A patent/JP7451716B2/en active Active
- 2019-12-31 KR KR1020227026332A patent/KR102663852B1/en active Active
- 2019-12-31 US US17/790,102 patent/US12356027B2/en active Active
- 2019-12-31 EP EP19845753.3A patent/EP3864839A1/en active Pending
- 2019-12-31 KR KR1020247014698A patent/KR20240065323A/en active Pending
-
2024
- 2024-03-05 JP JP2024032960A patent/JP7622271B2/en active Active
-
2025
- 2025-07-07 US US19/262,042 patent/US20250337969A1/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016183011A1 (en) * | 2015-05-11 | 2016-11-17 | Netflix, Inc. | Techniques for predicting perceptual video quality |
| US20170109584A1 (en) * | 2015-10-20 | 2017-04-20 | Microsoft Technology Licensing, Llc | Video Highlight Detection with Pairwise Deep Ranking |
| WO2019117864A1 (en) * | 2017-12-12 | 2019-06-20 | Google Llc | Transcoding media content using an aggregated quality score |
| US20190246112A1 (en) * | 2018-02-07 | 2019-08-08 | Netflix, Inc. | Techniques for predicting perceptual video quality based on complementary perceptual quality models |
| US20190258902A1 (en) * | 2018-02-16 | 2019-08-22 | Spirent Communications, Inc. | Training A Non-Reference Video Scoring System With Full Reference Video Scores |
| US20190295242A1 (en) * | 2018-03-20 | 2019-09-26 | Netflix, Inc. | Quantifying encoding comparison metric uncertainty via bootstrapping |
Non-Patent Citations (3)
| Title |
|---|
| HOSSEIN TALEBI ET AL: "Learned Perceptual Image Enhancement", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 December 2017 (2017-12-07), XP080845785 * |
| MANISH NARWARIA ET AL: "Machine learning based modeling of spatial and temporal factors for video quality assessment", IMAGE PROCESSING (ICIP), 2011 18TH IEEE INTERNATIONAL CONFERENCE ON, IEEE, 11 September 2011 (2011-09-11), pages 2513 - 2516, XP032080179, ISBN: 978-1-4577-1304-0, DOI: 10.1109/ICIP.2011.6116173 * |
| See also references of EP3864839A1 * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230140605A1 (en) * | 2020-12-21 | 2023-05-04 | Sling TV L.L.C. | Systems and methods for automated evaluation of digital services |
| CN116264606A (en) * | 2021-12-14 | 2023-06-16 | 戴尔产品有限公司 | Method, apparatus and computer program product for processing video |
| CN116416483A (en) * | 2021-12-31 | 2023-07-11 | 戴尔产品有限公司 | Computer-implemented method, apparatus and computer program product |
| CN115002520A (en) * | 2022-04-14 | 2022-09-02 | 百果园技术(新加坡)有限公司 | Video stream data processing method, device, equipment and storage medium |
| CN115002520B (en) * | 2022-04-14 | 2024-04-02 | 百果园技术(新加坡)有限公司 | Video stream data processing method, device, equipment and storage medium |
| US11606553B1 (en) * | 2022-07-15 | 2023-03-14 | RiversideFM, Inc. | Hybrid media recording |
| US11856188B1 (en) | 2022-07-15 | 2023-12-26 | RiversideFM, Inc. | Hybrid media recording |
| WO2024015139A1 (en) * | 2022-07-15 | 2024-01-18 | RiversideFM, Inc. | Hybrid media recording |
| US12088796B2 (en) | 2022-07-15 | 2024-09-10 | RiversideFM, Inc. | Hybrid media recording |
| US12348711B2 (en) | 2022-07-15 | 2025-07-01 | RiversideFM, Inc. | Hybrid media recording |
| CN115174919A (en) * | 2022-09-05 | 2022-10-11 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and medium |
| CN115174919B (en) * | 2022-09-05 | 2022-11-22 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2024069300A (en) | 2024-05-21 |
| KR20220123541A (en) | 2022-09-07 |
| CN114982227A (en) | 2022-08-30 |
| JP7622271B2 (en) | 2025-01-27 |
| KR102663852B1 (en) | 2024-05-10 |
| JP2023509918A (en) | 2023-03-10 |
| KR20240065323A (en) | 2024-05-14 |
| US20250337969A1 (en) | 2025-10-30 |
| EP3864839A1 (en) | 2021-08-18 |
| US20230054130A1 (en) | 2023-02-23 |
| BR112022012563A2 (en) | 2022-09-06 |
| CN114982227B (en) | 2025-04-01 |
| US12356027B2 (en) | 2025-07-08 |
| JP7451716B2 (en) | 2024-03-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20250337969A1 (en) | Optimal format selection for video players based on predicted visual quality using machine learning | |
| US10242265B2 (en) | Actor/person centric auto thumbnail | |
| US12118036B2 (en) | Summarizing video content | |
| US12217142B2 (en) | Using machine learning to detect which part of the screen includes embedded frames of an uploaded video | |
| KR102281863B1 (en) | Recommendation of live-stream content using machine learning | |
| US10777229B2 (en) | Generating moving thumbnails for videos | |
| US9002175B1 (en) | Automated video trailer creation | |
| US10755104B2 (en) | Scene level video search | |
| US12417522B2 (en) | Method for constructing a perceptual metric for judging video quality | |
| US9609323B2 (en) | Iterative video optimization for data transfer and viewing | |
| US20240403303A1 (en) | Precision of content matching systems at a platform | |
| KR20190004256A (en) | Mutual noise estimation for video | |
| WO2025183682A1 (en) | Systems and method for automatically generating modified video content | |
| CN116797466B (en) | Image processing method, device, equipment and readable storage medium | |
| US20250008051A1 (en) | Automatically generating colors for overlaid content of videos | |
| US11870833B2 (en) | Methods and systems for encoder parameter setting optimization | |
| Liu et al. | Data embedding scheme based on multi-matrix structure of turtle shell to avoid human eye perception | |
| US20250254375A1 (en) | Artificial intelligence system for media item recommendations | |
| CN121174008A (en) | Media resource distribution method and device, storage medium and electronic equipment | |
| WO2024229438A1 (en) | Video localization using artificial intelligence | |
| CN120107864A (en) | Processing method, device, equipment and storage medium for dual-branch model | |
| Cao | Ensuring the quality of experience for mobile augmented visual search applications: Fast, low bitrate and high accuracy |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2019845753 Country of ref document: EP Effective date: 20201202 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19845753 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022540529 Country of ref document: JP Kind code of ref document: A |
|
| REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112022012563 Country of ref document: BR |
|
| ENP | Entry into the national phase |
Ref document number: 20227026332 Country of ref document: KR Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 112022012563 Country of ref document: BR Kind code of ref document: A2 Effective date: 20220623 |
|
| WWD | Wipo information: divisional of initial pct application |
Ref document number: 202348047950 Country of ref document: IN |
|
| WWG | Wipo information: grant in national office |
Ref document number: 201980103334.2 Country of ref document: CN |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17790102 Country of ref document: US |
|
| WWP | Wipo information: published in national office |
Ref document number: 202348047950 Country of ref document: IN |
|
| WWG | Wipo information: grant in national office |
Ref document number: 202247041751 Country of ref document: IN |