CN111954082B

CN111954082B - Mask file structure, mask file reading method, computer device and readable storage medium

Info

Publication number: CN111954082B
Application number: CN201910414992.5A
Authority: CN
Inventors: 李超然; 陈志伟
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2023-03-24
Anticipated expiration: 2039-05-17
Also published as: CN111954082A

Abstract

The invention discloses a mask file structure, a mask file reading method, computer equipment and a readable storage medium, and belongs to the technical field of computers. The mask file can quickly position the mask frame data through the index unit, avoids the condition that the mask frame data needs to be read frame by frame, and ensures the aim of quickly positioning the mask frame data corresponding to the playing time when a client user drags a video progress bar. When the video is played, the area corresponding to the mask frame in the mask frame data segment, the barrage information and the frame image of the video can be drawn on the screen, so that the barrage information is displayed in the area outside the area corresponding to the mask frame, the purpose of avoiding the main area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

Description

Mask file structure, mask file reading method, computer device and readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a mask file structure, a method for reading a mask file, a computer device, and a readable storage medium.

Background

Barrage (barrage, bullet, or danmaku) belongs to a popular word in chinese, and is a commentary subtitle that pops up when a video is viewed over a network. The bullet screen is named as the name meaning that the effect looks like a bullet screen in a flying shooting game when a large number of spitting groove comments drift from the screen, so that the bullet screen is named. The user can increase the interaction between the user and between the user and the main broadcast in a bullet screen sending mode in the live broadcast watching process, and after the user sends the bullet screen, bullet screen characters can be displayed on a live broadcast picture in a rolling mode so as to be conveniently checked by other users, and a new network social contact form surrounding video content based on video watching is formed.

However, under the condition of too many barrages, the blocked video content is more, and the watching effect is influenced. In order not to influence the live broadcast watching, most users generally select to shield the bullet screen characters on the live broadcast picture, but the users select the bullet screen shielding mode, so that the bullet screen used for interaction cannot be presented on the live broadcast picture, and the interactivity of a live broadcast room is reduced.

Disclosure of Invention

Aiming at the problem that the viewing effect is influenced by too many bullet screens at present, a mask file structure, a mask file reading method, computer equipment and a readable storage medium aiming at the condition that the viewing effect of a user is not influenced are provided.

A mask file structure comprising: at least one mask frame data segment, an identification unit and an index unit; wherein,

the mask frame data section is used for recording at least one frame of mask frame data;

the identification unit is arranged at a first preset position in the mask file and is used for recording file identification, the coding format of the mask frame data segment and the size parameter of the index unit;

the index unit is arranged at a second preset position in the mask file and used for recording and indicating the physical position of each mask frame data segment and the length parameter of the mask frame data segment in the mask file.

Preferably, the mask frame data segment is composed of at least one mask frame data arranged according to a time sequence of timestamps of the mask frames according to a preset time length.

Preferably, the mask frame data includes a width, a height, a time stamp and frame data of the mask frame.

Preferably, the first preset position is a head of the mask file; the second preset position is located behind the first preset position.

Preferably, the identification unit is further configured to record a version number of the mask file.

Preferably, the length parameter is a length from a start mask frame to an end mask frame of the mask frame data segment.

Preferably, the physical location is a time stamp of a starting mask frame of the mask frame data segment.

The invention also provides a method for reading the mask file, which comprises the following steps:

acquiring the coding format of the mask frame data segment and a size parameter indicating an index unit;

reading the index unit according to the size parameter, and acquiring the physical position of each mask frame data segment and the length parameter of the mask frame data segment in the mask file;

and reading the mask frame data segment according to the coding format, the physical position and the length parameter.

Preferably, the step of obtaining the coding format of the mask frame data segment and the size parameter indicating the index unit includes:

and acquiring the coding format of the mask frame data segment and the size parameter of the indication index unit in the identification unit of the mask file.

Preferably, the step of reading the mask frame data segment according to the encoding format, the physical location and the length parameter includes:

and calculating the time stamp of the initial mask frame of the mask frame data segment corresponding to the current playing time in the index unit according to the encoding format and the length parameter according to the current playing time stamp, and acquiring the physical position of the corresponding mask frame data segment according to the time stamp of the initial mask frame.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method when executing the computer program.

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

The beneficial effects of the above technical scheme are that:

according to the technical scheme, the mask file can be used for rapidly positioning the mask frame data through the index unit, the condition that the mask frame data need to be read frame by frame is avoided, and the purpose that the mask frame data corresponding to the playing time can be rapidly positioned when a client user pulls the video progress bar is ensured. When the video is played, the area corresponding to the mask frame in the mask frame data segment, the barrage information and the frame image of the video can be drawn on the screen, so that the barrage information is displayed in the area outside the area corresponding to the mask frame, the purpose of avoiding the main area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

Drawings

FIG. 1 is a block diagram of one embodiment of a mask file reading system according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of a mask file structure according to the present invention;

FIG. 3 is a flowchart of a method of an embodiment of a method for reading a mask file according to the present invention;

FIG. 4 is a flowchart of a method of displaying bullet screen information according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for identifying a subject region of a frame image in a video to generate mask frame data segments according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method of one embodiment of a method of obtaining mask frame data according to the present invention;

FIG. 7 is a flowchart of a method for identifying a subject region of a frame image in a video using a semantic segmentation model according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for identifying a subject region of a frame image in a video using a semantic segmentation model according to another embodiment of the present invention;

FIG. 9 is a flowchart of a method for identifying a subject region of a frame image in a video using an example segmentation model according to one embodiment of the present invention;

FIG. 10 is a flowchart of a method of another embodiment of the method of acquiring mask data of the present invention;

FIG. 11 is a flowchart of a method for rendering an area corresponding to a mask frame in a mask frame data segment, bullet screen information, and a frame image of a video on a screen according to an embodiment of the present invention;

FIG. 12 is a flowchart of a method of an embodiment of a method for rendering a bullet screen mask of the present invention;

FIG. 13 is a block diagram of an embodiment of a system for reading a mask file according to the present invention;

fig. 14 is a schematic hardware configuration diagram of a computer device for executing a method for reading a mask file according to an embodiment of the present invention.

Detailed Description

The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

The video of the embodiment of the application may be presented to clients such as large-scale video playing devices, game machines, desktop computers, smart phones, tablet computers, MP3 (dynamic picture experts group audio layer) players, MP4 (dynamic picture experts group audio layer) players, laptop portable computers, e-book readers, and other display terminals.

The video in the embodiment of the application can be applied to not only the video playing program of the match type, but also any application scene capable of presenting the video, for example, the video can be applied to some job-seeking programs, some relatives, entertainment programs of multi-party confrontation, and the like. The embodiment of the present application takes the application of the video to the soccer video playing program as an example, but is not limited to this.

In the embodiment of the application, the server W processes the video data uploaded by the user side and the corresponding barrage data to generate a mask file, and the mask file can be sent to each viewing end (i.e., the stream pulling end) by the server W, and each viewing end plays the video information, the mask frame data segment, and the barrage information. Referring to fig. 1, fig. 1 is a system architecture diagram for reading a mask file according to an embodiment of the present application. As shown in fig. 1, a user a, a user B, and a user C view video data, which is processed by a server W and is not displayed in a main area of a video, through a wireless network, and both a user D and a user E view video data, which is processed by the server W and is not displayed in the main area of the video, through a wired network. Only one server W is shown here, and the application scenario here may also include multiple servers in communication with each other. The server W may be a cloud server or a local server. In the embodiment of the present application, the server W is placed on the cloud side. If the server W processes the certain recorded live video data and the bullet screen information, the processed video data is forwarded to the user A, the user B, the user C, the user D and the user E.

The invention provides a mask file for solving the defect that viewing is influenced by too many bullet screens in the prior art. Referring to fig. 2, which is a structure of a mask file according to a preferred embodiment of the present invention, it can be seen that the mask file provided in the present embodiment is described.

The mask file structure (as shown in FIG. 2) may include: at least one mask frame data segment, an identification unit and an index unit (i.e., a location frame list); wherein:

the mask frame data section is composed of at least one mask frame data which is arranged according to a preset time length and the time sequence of a timestamp of the mask frame.

The mask frame data may include the width, height, time stamp (i.e., time stamp of the frame image of the original video) and frame data of the mask frame.

Each mask frame data segment contains several successive mask frame data, which are arranged closely from large to small in terms of their pts _ time _ ms, depending on the coding format (i.e., codec _ id value) in the identification unit. For example:

|mask frames sorted by pts_time_ms,optionally compressed|

|......|

|mask frames sorted by pts_time_ms,optionally compressed|

when codec _ id =0x0 (bitstream) mask coding is employed, a series of mask frames are closely arranged from small to large in terms of their pts _ time _ ms, and then compressed into a mask frame data segment using the gzip compression algorithm. The mask frame data consists of: frame width + frame height + frame PTS + frame data, and the format of the mask frame data is as follows:

wherein, width represents the width of the frame, occupies 2 bytes, network sequence, unsigned integer; height represents the height of a frame, occupies 2 bytes, has a network sequence and is in unsigned integer; pts _ time _ ms represents the pts time of the frame, occupies 8 bytes, and has network sequence, unsigned integer and unit ms; the data represents the binary data of the frame, occupies (width height)/8 bytes, each bit represents a pixel point, and the width is stored preferentially.

When codec _ id =0x1 (svg) mask coding is adopted, a series of mask frame data are closely arranged from small to large according to pts _ time _ ms thereof without compression. The mask frame data consists of: frame data length + frame PTS + frame data, and the format of mask frame data is as follows:

|4bytes|8bytes|data_size bytes|

|data_size|pts_time_ms|data|

wherein, data _ size represents the length of frame data, occupies 4 bytes, network sequence, unsigned integer, the length of frame data does not contain data _ size field and pts _ time _ ms field itself; the pts _ time _ ms represents the pts time of the frame data (from which original image frame the mask frame is taken, i.e. the timestamp corresponding to the frame image of the original video), and occupies 8 bytes, network order, unsigned integer, unit ms; data represents the binary data of the frame, occupying the data _ size byte, svg format.

When codec _ id =0x2 (svg, gzip compressed) mask encoding is used, a series of mask frame data are closely arranged from small to large in terms of pts _ time _ ms thereof, and then compressed using a gzip compression algorithm. The mask frame data consists of: frame data length + frame PTS + frame data, and the format of mask frame data is as follows:

|4bytes|8bytes|data_size bytes|

|data_size|pts_time_ms|data|

wherein, data _ size represents the length of frame data, occupies 4 bytes, and has a network sequence, and the length of frame data does not contain data _ size field and pts _ time _ ms field; pts _ time _ ms represents the pts time of a frame (from which original image frame the mask frame is taken, i.e., the timestamp corresponding to the frame image of the original video), and occupies 8 bytes, network order, unit ms; data represents the binary data of the frame, occupying the data _ size byte, svg format.

the identification unit is further used for recording the version number of the mask file.

The identification unit is fixed to 16 bytes and located at the first 16 bytes of the mask file, and the specific structure is as follows:

wherein, the file tag represents a file identifier, the fixed value is "MASK", occupies 4 bytes, and can be regarded as magic number (magic number); version represents the version number of the mask file, occupies 4 bytes, has a network sequence and is shaped without symbols, the legal value can be 1, and the mask file is required to be regarded as an invalid file when a high version is encountered; reserved represents a reserved field, occupies 3 bytes, and the padding can be 0; entry _ num represents the number of index entries of the index unit, occupies 4 bytes, has network order and unsigned integer, and has the length of a frame index entry fixed to 16 bytes; the codec _ id represents an encoding mode, occupies 1 byte, is unsigned and integer, describes an encoding format of a mask frame and a mask frame data segment, and legal values can be in the following form:

a coding mode of a mask frame and a format of a mask frame data segment.

A series of mask frames are closely arranged according to pts _ time _ ms (which represents the original image frame from which the mask frame is taken, namely the timestamp corresponding to the frame image of the original video), and then gzip compression is used.

J0x1, (xi) svg | - - - - -a series of mask frames are closely arranged from small to large according to pts _ time _ ms, and are not compressed.

J0x2, (xi) a series of mask frames are closely arranged from small to large according to pts _ time _ ms, and then gzip compression is used.

The index unit is arranged at a second preset position in the mask file and used for recording and indicating the physical position of each mask frame data segment and the length parameter of the mask frame data segment in the mask file. The length parameter is the length from the initial mask frame to the end mask frame of the mask frame data section. The physical location is a timestamp of a starting mask frame of the mask frame data segment.

The first preset position is the head of the mask file; the second preset position is located behind the first preset position.

The index unit is formed by closely arranging a plurality of entries (entries) with the same length, and the length of each entry is fixed as the structure of each entry of 16 bytes, for example:

|8bytes|8bytes|

|pts_time_ms|file_offset|

|......|......|

|pts_time_ms|file_offset|

the index unit consists of pts _ time _ ms and file _ offset.

pts _ time _ ms:8 bytes, network order, unsigned integer, and pts _ time, unit is ms, of the initial mask frame contained in the mask frame data segment;

file _ offset:8 bytes, net order, unsigned integer, indicates the offset of the mask frame data segment in the mask file.

It should be noted that: the entries in the index unit are stored in the order of their pts _ time _ ms size, which is advantageous for fast retrieval of the mask frame data segment where a frame whose pts _ time _ ms is known to be located, and if entry B is before a, the length of the frame data segment pointed to by a is b.file _ offset-a.file _ offset, and for the last entry, the length of its file _ offset to the end of the file.

In practical application, a client requests an indexing unit through an HTTPS (HyperText Transfer Protocol), acquires an identification unit and an indexing unit, and according to a current video viewing progress, an offset of corresponding mask frame data in a mask file (that is, a position of a mask frame data segment where the mask frame data is located in the mask file) can be found in the indexing unit, and the mask frame data segment corresponding to the playing progress moment is downloaded through an HTTP request, so that when a client user pulls a video progress bar, the client can quickly locate the mask frame data corresponding to the playing moment, and a viewing effect of the user is improved.

In general, the identification unit describes brief information of the whole file, and the index unit table is used for quickly indexing a mask frame data segment according to pts _ time (i.e. time stamp corresponding to a frame image of the original video), wherein the mask frame data segment contains mask frame data in a certain time period.

The format of the mask file is: identification unit + index unit + several pieces of mask frame data, which are stored in close proximity in a mask file, for example:

the mask file header represents the unit of identification.

The | mask frame indexing table | - - - -represents an index unit.

The | mask frames data segment | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -represents the mask frame data segment.

The | mask frames data segment | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -representing mask frame data segment.

|......|

By way of example and not limitation, the mask file structure may be applied to an offline on-demand video scene, where a subject region identification is performed on each frame of image in the video at the server, and a mask file structure including a segment of mask frame data is generated.

In this embodiment, the mask file can quickly locate the mask frame data through the index unit, thereby avoiding the situation that the mask frame data needs to be read frame by frame, and ensuring the purpose of quickly locating the mask frame data corresponding to the playing time when the client user drags the video progress bar. When the video is played, the area corresponding to the mask frame in the mask frame data segment, the barrage information and the frame image of the video can be drawn on the screen, so that the barrage information is displayed in the area outside the area corresponding to the mask frame, the purpose of avoiding the main area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

For the mask file described above, the method for reading the mask file may include the following steps (shown with reference to fig. 3):

C1. acquiring the coding format of the mask frame data segment and a size parameter indicating an index unit;

specifically, the step C1 of obtaining the coding format of the mask frame data segment and the size parameter indicating the index unit includes:

C2. Reading the index unit according to the size parameter, and acquiring the physical position of each mask frame data segment and the length parameter of the mask frame data segment in the mask file;

C3. and reading the mask frame data segment according to the coding format, the physical position and the length parameter.

Specifically, the step A3 of reading the mask frame data segment according to the coding format, the physical location and the length parameter includes:

The above-mentioned method for reading a mask file can be applied to a method process for displaying bullet screen information, and the method process for displaying bullet screen information will be described in detail with reference to the flowcharts of fig. 4 to 5. The processing corresponding to these flowcharts can be realized by the processor reading a corresponding processing program stored in the storage medium, for example, loading the program into the memory and executing it.

As shown in fig. 4, the method for displaying bullet screen information mainly includes the following steps:

s1, identifying a main body area of at least one frame of image in a video to generate at least one mask frame data segment;

each mask frame data segment corresponds to a position frame list for positioning the physical position of the mask frame data segment.

The location frame list is used to store the physical address of each mask frame data segment, and each mask frame data segment associated with the video can be queried through the location frame list.

It should be noted that: the body region may be selected from at least one of:

a person area range, an animal area range, a landscape area range, a building area range, an artwork area range, a text area range, and a background area range distinguished from a person, an animal, a building, and an art.

The step S1 of identifying the main body region of at least one image in the video to generate at least one mask frame data segment may include (shown in reference to fig. 5):

s11, splitting the video into at least one frame image;

s12, identifying a main body area in the frame image;

s13, generating mask frame data corresponding to the frame image according to the main body area in the frame image;

in this step, mask frame data corresponding to the frame image is generated based on a main body region corresponding to a main body region in the frame image, a size of the main body region, and a video time stamp corresponding to the frame image.

Next, with respect to the process of acquiring mask frame data in the present embodiment, the process of acquiring mask frame data will be described in detail with reference to the flowcharts of fig. 6 to 9. The processing corresponding to these flowcharts can be realized by the processor reading a corresponding processing program stored in the storage medium, for example, loading the program into the memory and executing it.

As shown in fig. 6, a method of acquiring mask frame data may include:

A1. identifying a main body area of at least one frame of image in the video based on an image segmentation algorithm;

as one example, a semantic segmentation model may be employed to identify a subject region of at least one image in a video.

The semantic segmentation model sequentially comprises at least two feature extraction modules, at least one feature enhancement layer and a classification layer.

Referring to fig. 7, the step of identifying a subject region of at least one image in the video using a semantic segmentation model may include:

a1-1-1, respectively extracting a feature map of at least one frame of image of the video through each feature extraction module;

a1-1-2, performing step-by-step fusion on the feature maps output by each feature extraction module, and fusing the feature maps finally output by all the feature extraction modules by the at least one feature enhancement layer to generate a comprehensive feature map;

and A1-1-3, the classification layer obtains the main body region according to a pixel prediction semantic segmentation result corresponding to the comprehensive characteristic diagram.

The semantic segmentation model can adopt FCN, dilatedNet, depelab and other models.

By way of example, and not limitation, the semantic segmentation model employs a deplab model. The deplab model has the advantages of good effect and high speed. The deep model mainly comprises a network backbone (backbone) for extracting a feature map, a feature enhancement layer for enhancing features and reducing the size influence of the feature map, and a classification layer for predicting a class corresponding to each pixel (class 0 is usually background, and more classes are 91 classes of a coco data set, including people, some animals, some common objects, and the like).

Further, before the step of performing the step A1-1-3, where the classification layer obtains the main region according to the pixel prediction semantic segmentation result corresponding to the comprehensive feature map, the method further includes:

and adopting a conditional random field module to carry out constraint processing on the pixel points of each object region in the comprehensive characteristic diagram to obtain the processed comprehensive characteristic diagram.

In this step, considering that the boundary of the extracted object region is rough, in order to improve the continuity and the fitness of the boundary, a boundary detection post-processing mode is adopted, and a conditional random field module is used for carrying out smoothing processing on a current image according to a previous frame image (or a certain frame of previous surfaces), so that the continuity of the boundary of the object region of each frame is improved, and the visual fitness is improved.

Referring to fig. 8, before the step of performing the step of identifying the main region of at least one frame of image in the video by using the semantic segmentation model, the method may further include:

a1. acquiring at least one first sample image, wherein the first sample image is an object image comprising a person and/or an animal;

a2. obtaining at least one second sample image, wherein the second sample image is a background image without people and animals;

a3. extracting an object region in the object image;

a4. synthesizing the object region and the background image to generate a training sample set;

a5. and (4) training the initial segmentation model by adopting a training sample set to obtain a semantic segmentation model, and executing the step A1-1-1.

In practical application, considering that training samples of an animation are not easy to collect, before a semantic segmentation model is adopted to identify a main body region in an animation video, the semantic segmentation model needs to be trained, and a training sample set conforming to the animation needs to be provided. Therefore, a training sample set with an animated character and a background can be synthesized through the steps a1-a 5. Firstly, obtaining a batch of animation character (such as a quadratic element animation character) images with transparent backgrounds or single-color backgrounds (simple backgrounds), and obtaining images of animation character parts by adopting a matting mode (such as a clustering algorithm); then obtaining a batch of animation background images without animation characters; respectively superposing and synthesizing the animated character image and the background image by adjusting the proportion and the tone so as to obtain a training sample set; and training the initial segmentation model by adopting a training sample set to obtain a semantic segmentation model.

Further, the step a5 of training the initial segmentation model by using the training sample set to obtain the semantic segmentation model may include:

inputting a training sample set into an initial segmentation model to obtain a corresponding object region identification result, and updating a parameter value of the initial segmentation model;

and obtaining the semantic segmentation model until the training of the initial segmentation model is completed.

As another example, the example segmentation model may also be used to identify a subject region of at least one image in the video.

Wherein the instance segmentation model comprises: the device comprises an extraction module, a classification module, a regression module and a mask module;

the extraction module comprises: the device comprises a feature extraction module and a region extraction module.

Referring to fig. 9, the step of identifying a subject region of at least one image in the video using the example segmentation model may include:

a1-2-1, extracting a feature map of at least one frame of image of the video through a feature extraction module;

a1-2-2, the region extraction module performs non-maximum suppression on the feature map to extract a candidate region, and generates a target region feature map according to the feature map and the candidate region;

a1-2-3, predicting and obtaining the category of the target region feature map through the classification module;

a1-2-4, predicting frame position information of the target region feature map through the regression module;

a1-2-5, calculating a segmentation mask of the target region feature map through the mask module;

and A1-2-6, acquiring the main body region according to the belonged category, the frame position information and the segmentation mask.

By way of example, and not limitation, the example segmentation model may employ a maskrcnn model. Considering that the control force is weak when the semantic segmentation model is added into the post-processing subsequently, and no method is available for controlling the instance level, the stability is improved by adopting the instance segmentation model (the application range is wider). The maskrnnn model mainly comprises a network backbone (backbone) for extracting a feature map, a region extraction module (ROI + RPN + align) for extracting a feature region, a classification module for classifying, a regression module and a mask module.

In practical applications, the iterative model preferentially selects the semantic segmentation model because the example segmentation model is less efficient in training the model and generating the mask. The semantic segmentation model can be preferentially adopted for the video with long video time and large tasks and simple scenes, and the speed is high. And for videos with complex scenes, an instance segmentation model can be adopted, and the recognition effect is good.

A2. And generating mask frame data according to the main body area.

Step A2, generating mask frame data according to the main body region, comprising:

and generating mask frame data according to the width and the height of the main body area and the corresponding time stamp of the frame image in the video.

Next, with respect to the process of acquiring mask frame data in the present embodiment, the process of acquiring mask frame data will be described in detail with reference to the flowchart of fig. 10. The processing corresponding to these flowcharts can be realized by the processor reading a corresponding processing program stored in the storage medium, for example, loading the program into the memory and executing it.

As shown in fig. 10, a method of acquiring mask data includes the steps of:

B1. acquiring a main body area of at least one frame of image in a video;

it should be noted that: the body region may be selected from at least one of:

B2. Converting the body region into contour data;

the step of converting the body region into contour data in step B2 may include:

respectively expressing each pixel point in the main body area by adopting a color value;

and the color values corresponding to all the pixel points in the main body region form the contour data.

In an embodiment, each pixel in the subject region is marked (marked as a person or non-person); for example: the character portion may be marked black and the non-character portion may be marked white, and the resulting mask data is a shadow-like picture (i.e., in Bitmap format).

At least one of steps B21, B22, B23 and B24 may be further included after the step of converting the body region into the profile data in step B2 is performed, specifically as follows:

b21. and compressing the pixel resolution of the contour data, and adjusting the pixel resolution of the contour data to be within a preset pixel resolution range.

For a 1080 × 720 frame of video picture, the Bitmap needs 777660 pixels to fully describe, so to reduce the volume of the Bitmap, the volume of the Bitmap can be reduced by reducing the picture size. Considering that the mask frame data does not actually need as high definition as the original video, and even if the resolution of the mask frame data itself is much lower than that of the video, the final effect will not be significantly reduced, so the size of the outline data Bitmap can be limited to a preset size such as: 320 x 180, thereby achieving the purpose of reducing the volume of the Bitmap.

b22. And compressing the color bit depth of the outline data, and adjusting the color bit depth of the outline data to a preset binary bit.

Generally, each pixel of the Bitmap needs RGBA8888, namely red, green, blue and transparency, 4 attributes, each attribute is 8 bits to express the color of the pixel, and for mask frame data, it can be realized by using binary bits to express whether the pixel is a person or not, so that the Bitmap volume can be reduced by reducing the storage space occupied by each pixel.

b23. And compressing the contour data.

By way of example and not limitation, the gzip algorithm can be used to compress the outline data in this step, the Bitmap of the outline data has a very distinct feature, and both the human and non-human parts appear in continuous large blocks, so that the data repetition rate is extremely high, and the compression rate can be effectively improved by using the gzip algorithm.

b24. And performing edge feathering processing on the contour data.

Considering that the mask frame data does not need the definition as high as that of the original video in practice, the outline data edge can be feathered by adopting a fuzzy edge method, so that the smoothness of the outline data is improved, and the visual effect is improved.

The step of converting the body region into the contour data in step B2 may include:

converting the body region into profile data in Scalable Vector Graphics (SVG) format.

SVG is a graphic format that describes two-dimensional vector graphics based on extensible markup language (a subset of standard universal markup language), which outlines the edges of a character through curvilinear equations.

B3. And generating mask frame data according to the contour data and the corresponding time stamp of the frame image in the video.

It should be noted that: when the client is a mobile terminal, the data in the Bitmap format can be directly used; when the client displays the data in the SVG format through a browser (limited to the CSS standard of the browser), the client can only accept the data in the SVG format, so that the data in the SVG format is required on the browser, but the data in the SVG format is required to be Bitmap finally (other vector formats, such as SVG, can be converted into Bitmap firstly and then output to a computer).

S14, generating at least one mask frame data segment comprising at least one mask frame data according to the mask frame data.

Generating mask frame data by identifying a main body area of each frame of image in a video in a server, forming a mask frame data section by the mask frame data, and finally obtaining a mask file corresponding to the video. The mask file includes a mask frame data segment.

And S2, drawing an area corresponding to a mask frame in the mask frame data section, bullet screen information and a frame image of the video on a screen, wherein the bullet screen information is displayed in an area except the area corresponding to the mask frame.

In practical application, when the main body area is the human area range, the bullet screen information is not displayed in the human area range and is displayed in the areas except the human area range; when the main body area is a text area range, the bullet screen information is not displayed in the text area range and is displayed in an area except the text area range; when the main area is a background area range different from a person, an animal, a building, or an art, the bullet screen information is not displayed in the background area range and is displayed in an area other than the background area range.

Step S2 of drawing the region corresponding to the mask frame in the mask frame data segment, the barrage information, and the frame image of the video onto the screen (as shown in fig. 11) may include:

s21, decompressing the mask frame data segment;

s22, acquiring a mask area of the mask frame and a corresponding video timestamp;

and S23, drawing the mask area, the bullet screen information corresponding to the video timestamp and the frame image of the video corresponding to the video timestamp on a screen according to the video timestamp. Therefore, the consistency of the mask area, the bullet screen information and the frame image of the video in time is ensured.

Before executing step S23, drawing the mask area, the bullet screen information corresponding to the video timestamp, and the frame image of the video corresponding to the video timestamp onto the screen according to the video timestamp, the method further includes:

and performing edge feathering treatment on the mask area of the mask frame to improve the smoothness of the edge of the mask frame, thereby improving the visual effect.

In the embodiment, a mask frame data segment is generated by identifying a main body area of at least one frame of image in a video; when the video is played, the area corresponding to the mask frame in the mask frame data segment, the barrage information and the frame image of the video can be drawn on the screen, so that the barrage information is displayed in the area outside the area corresponding to the mask frame, the purpose of avoiding the main area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

Next, with respect to the processing procedure of displaying the bullet screen information on the screen in this embodiment, the procedure of the rendering method of the bullet screen mask will be described in detail with reference to the flowchart of fig. 12. The processing corresponding to these flowcharts can be realized by the processor reading a corresponding processing program stored in the storage medium, for example, loading the program into the memory and executing it.

As shown in fig. 12, a method for rendering a bullet screen mask may include the steps of:

D1. acquiring bullet screen information, video data and corresponding mask frame data segments;

D2. decompressing the mask frame data segment;

the step of decompressing the mask frame data segment in step D2 may comprise:

and amplifying the display proportion of each decompressed mask frame data according to a preset decompression proportion, thereby realizing the consistency of the sizes of the mask area corresponding to the mask frame data and the main area of the original video image and ensuring the watching effect of a user.

Specifically, the display scale of the mask frame data may be enlarged in a bilinear stretching manner.

Before the step of decompressing the mask frame data segment in step D2, the method may further include:

and converting the mask frame data segment into a mask frame data segment in a raster graph format (namely, a bitmap file format). Since the bullet screen mask is finally processed in the bitmap file format, the data format needs to be uniformly converted into the bitmap file format before processing.

D3. Rendering decompressed mask frame data when the video data is played, drawing the mask frame data and barrage information into a frame image, and displaying the mask frame data when the barrage information passes through the mask frame data.

In the step D3, the decompressed mask frame data is rendered when the video data is played, and the step of drawing the mask frame data and the barrage information into the frame image may include:

and performing edge feathering processing on the mask frame data, and drawing the processed mask frame data and barrage information into a frame image according to a video time stamp when the video data is played. Therefore, the consistency of the masking area, the barrage information and the frame image of the video in time is ensured. And performing edge feathering treatment on the mask frame data, so that the edge of the mask frame data is softer and more natural.

In the step D3, when the bullet screen information passes through the mask frame data, the step of displaying the mask frame data includes:

and the mask frame data is a transparent channel, and when the bullet screen information passes through the mask frame data, the bullet screen information is multiplied by the transparency of the mask frame data to be drawn into the frame image.

When the bullet screen information is displayed, the bullet screen gradually changes from completely opaque to completely transparent at the edge of the frame data of the mask, so that the mask is softer and more natural, and the problem of low accuracy of recognition of the edge of the picture main body by an algorithm is effectively avoided.

Referring to fig. 13, a mask file reading method provides a mask file reading system, including: an acquisition unit 11, a processing unit 12 and a reading unit 13, wherein:

an obtaining unit 11, configured to obtain a coding format of the mask frame data segment and a size parameter indicating an index unit;

the processing unit 12 is configured to read the index unit according to the size parameter, and obtain a physical position of each mask frame data segment and a length parameter of the mask frame data segment in the mask file;

a reading unit 13, configured to read the mask frame data segment according to the encoding format, the physical location, and the length parameter.

In a preferred embodiment, in a system for reading a mask file, the step of acquiring the encoding format of the mask frame data segment and the size parameter indicating the index unit by the acquiring unit 11 includes:

In a preferred embodiment, in a system for reading a mask file, the step of reading the mask frame data segment according to the encoding format, the physical location and the length parameter by the reading unit 13 includes:

As shown in fig. 14, a computer apparatus 2, the computer apparatus 2 comprising:

a memory 21 for storing executable program code; and

and the processor 22 is used for calling the executable program codes in the memory 21, and the execution steps comprise the reading method of the mask file.

Fig. 14 illustrates an example of one processor 22.

The memory 21 is a non-volatile computer-readable storage medium, and can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for reading a mask file in the embodiment of the present application. The processor 22 executes various functional applications and data processing of the computer device 2, namely, implements the method for reading the mask file of the above-described method embodiment, by executing the nonvolatile software program, instructions and modules stored in the memory 21.

The memory 21 may include a program storage area and a data storage area, wherein the program storage area may store an application program required for at least one function of the operating system; the storage data area may store playback information of the user on the computer device 2. Further, the memory 21 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 21 optionally comprises a memory 21 remotely located with respect to the processor 22, and these remote memories 21 may be connected to the mask file reading system 1 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 21, and when executed by the one or more processors 22, perform the method for reading the mask file in any of the above-described method embodiments, for example, the structure of fig. 2 and the method steps in fig. 3 described above are performed, and the functions of the system 1 for reading the mask file shown in fig. 13 are realized.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The computer device 2 of the embodiment of the present application exists in various forms, including but not limited to:

(1) A mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

The present application provides a non-transitory computer-readable storage medium, which stores computer-executable instructions, which are executed by one or more processors, such as one processor 22 in fig. 14, so that the one or more processors 22 may execute the method for reading a mask file in any method embodiment described above, for example, execute the structure of fig. 2 described above and the method steps in fig. 3 to implement the functions of the system 1 for reading a mask file shown in fig. 13.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on at least two network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a Random Access Memory (RAM), or the like.

The first embodiment,

The method for reading the mask file can be applied to an off-line video-on-demand scene, and main body area identification is carried out on each frame image in the video in the server to generate a mask file structure comprising a mask frame data segment. When a client requests to play the video file from a server, an identification unit and an index unit are obtained, the index unit is searched according to a video timestamp played by the current video file, a physical address of a mask frame data segment corresponding to the current playing moment is obtained, the mask frame data segment is requested from the server according to the physical address, the server issues the corresponding mask frame data segment to the client according to the client request, the client renders the mask frame data, the processed mask frame data and bullet screen information are drawn into corresponding video frame images, and therefore bullet screen information is displayed in an area outside the area corresponding to the mask frame, the purpose of preventing a main body area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

Example II,

The method for reading the mask file can be applied to offline play on demand, animation character recognition is carried out on each frame of image in the video in the server, and a mask file structure comprising a mask frame data segment is generated. When a client requests to play the video file from a server, an identification unit and an index unit are obtained, the index unit is searched according to a video timestamp played by the current video file, a physical address of a mask frame data segment corresponding to the current playing time is obtained, the mask frame data segment is requested to the server according to the physical address, the server issues the corresponding mask frame data segment to the client according to the client request, the client renders the mask frame data, the processed mask frame data and bullet screen information are drawn into corresponding video frame images, and therefore bullet screen information is displayed in areas except for animation characters, the purpose of preventing a main body area in the video from being covered during playing is achieved, and the watching effect of a user is improved.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A mask file structure, comprising: at least one mask frame data segment, an identification unit and an index unit; wherein,

the mask frame data segments are used for recording at least one frame of mask frame data, each mask frame data segment comprises a plurality of continuous mask frame data, wherein an example segmentation model is adopted to identify a main body area of at least one frame of image in a video, and the mask frame data are generated according to the main body area;

the index unit is arranged at a second preset position in the mask file and is composed of a plurality of table entries with the same length in an arrangement mode, each table entry is used for recording and indicating a physical position of each mask frame data segment and a length parameter of each mask frame data segment in the mask file, the physical position is a timestamp of a starting mask frame of each mask frame data segment, the length parameter is recorded through an offset of each mask frame data segment in the mask file, and the table entries are stored according to the size sequence of the timestamps.

2. The mask file structure of claim 1, wherein the mask frame data segments are composed of at least one mask frame data according to a predetermined time length and a time sequence of time stamps of the mask frames.

3. The mask file structure of claim 2, wherein the mask frame data includes a width, a height, a timestamp, and frame data of the mask frame.

4. The mask file structure of claim 1, wherein the first preset location is a header of the mask file; the second preset position is located behind the first preset position.

5. The mask file structure of claim 1, wherein the identification unit is further configured to record a version number of the mask file.

6. The mask file structure of claim 1, wherein the length parameter is a length from a beginning mask frame to an end mask frame of the mask frame data segment.

7. A method for reading a mask file, comprising the steps of:

acquiring a coding format of a mask frame data segment and a size parameter indicating an index unit, wherein each mask frame data segment comprises a plurality of continuous mask frame data, a main area of at least one frame of image in a video is identified by adopting an example segmentation model, and the mask frame data is generated according to the main area;

reading the index unit according to the size parameter, and acquiring the physical position of each mask frame data segment and the length parameter of the mask frame data segment in the mask file, wherein the index unit is composed of a plurality of entries with the same length, the physical position is a timestamp of a starting mask frame of the mask frame data segment, the length parameter is recorded through the offset of the mask frame data segment in the mask file, and the entries are stored according to the size sequence of the timestamp;

reading the mask frame data segment according to the encoding format, the physical location and the length parameter, including: and calculating the time stamp of the initial mask frame of the mask frame data segment corresponding to the current playing time in the index unit according to the encoding format and the length parameter according to the current playing time stamp, and acquiring the physical position of the corresponding mask frame data segment according to the time stamp of the initial mask frame.

8. The method for reading the mask file according to claim 7, wherein the step of obtaining the encoding format of the mask frame data segment and the parameter indicating the size of the index unit comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 7 to 8 when executing the computer program.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 7 to 8.