CN101419670B

CN101419670B - Video monitoring method and system based on advanced audio/video encoding standard

Info

Publication number: CN101419670B
Application number: CN2008102032020A
Authority: CN
Inventors: 王新; 路红; 宋元征; 陈桂财
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2008-11-21
Filing date: 2008-11-21
Publication date: 2011-11-02
Anticipated expiration: 2028-11-21
Also published as: CN101419670A

Abstract

The invention belongs to the technical field of video monitoring, in particular to an AVS (Advanced Audio Video Coding Standard)-based video monitoring method and an implementation system thereof. The present invention conforms to the development trend of video monitoring, introduces automatic processing and AVS standard into video monitoring, combines background/non-background classification, face detection and recognition and other technologies, and automatically processes the monitoring video through the computer system in advance to ensure the effectiveness of the returned content. Under the premise of reliability, the amount of information fed back to the operator will be much smaller than that of the traditional monitoring system, which greatly saves human resources and improves the reliability of the video monitoring system. It is the first to use AVS in video surveillance technology and patent application advantages. With the strong support of the national and local governments for the application and promotion of AVS, the invention has certain application value in the application fields of digital surveillance, access control and identification.

Description

Video frequency monitoring method and system based on advanced audio/video encoding standard

Technical field

The invention belongs to technical field of video monitoring, be specially a kind of video frequency monitoring method and realization system thereof based on AVS (advanced audio/video encoding standard).

Background technology

Nowadays safety problem has been subjected to extensive concern, has emerged in large numbers increasing video monitoring system, as gate control system, attendance checking system and identification system or the like.Video monitoring system can allow managerial personnel observe front end in the pulpit and take precautions against in the zone all personnel's active situation and keep a record, for security system provides real-time image, acoustic information.But, traditional video monitoring system needs the great amount of manpower resource overhead, detection, identification and understanding to the monitor video content rely on manually fully, reduced the work efficiency of video monitoring system, security and accuracy also lack assurance, and also do not have special-purpose digital video monitor system video compression standard at present as the video compression standard of video monitoring system core technology, on Network Transmission and system's versatility, caused bigger problem.

Summary of the invention

The objective of the invention is to propose a kind of high efficiency, video frequency monitoring method and system that security is good.

The present invention complies with the video monitoring trend, robotization is handled and AVS standard introducing video monitoring, in conjunction with technology such as background/non-background class, the detection of people's face and identifications, in advance to the automatic processing of monitor video by computer system, under the prerequisite of the validity that guarantees returned content, the quantity of information that feeds back to operating personnel will be much smaller than traditional supervisory system, thereby has saved human resources greatly, has also improved the reliability of video monitoring system simultaneously.Initiative utilizes AVS at video monitoring technical elements and patent application advantage, and along with country and local government support application to AVS energetically, the present invention controls and application such as identification has certain application value digital supervision, gate inhibition.

The present invention at first gathers according to the AVS code stream by the AVS web camera, uses the compressed domain in the AVS code stream decoding process to carry out the classification of background and non-background.When classification results shows that current frame is not background, carry out people's face and detect.When detecting people's face, carry out recognition of face, be about to people's face data and carry out comparing with training data after the conversion.Before recognition result is fed to the user, calculate degree of confidence t earlier, t shows the credibility of current recognition result.(t_min is obtained by the empirical data statistics during less than threshold value t_min as degree of confidence t, t_min is high more, and then accuracy rate is high more, t_min is low more, and then recall ratio is high more, set a suitable t_min by balance according to system's actual conditions), we think that this people's face does not belong to the data in the Current Library, regard as the stranger, and this result is fed back to the user, new people's face adds in the storehouse with this after the user confirms.When degree of confidence during more than or equal to threshold value t_min, show that recognition result has higher confidence level, write down recognition result then and video is marked.Fig. 1 is the process flow diagram of this video monitoring system, has wherein embodied two characteristics of the present invention, and AVS uses and robotization is handled.

The system of specific implementation mainly forms training module, labeling module and retrieval module by three parts.

Training module comprises the training module of monitoring environment background and the training module of face database, implements respectively to import to people's face sample storehouse and background sample storehouse to the environmental background training with to the training of people's face, is output as each face characteristic and background characteristics.

Labeling module comprises that background detection module, people's face detection module, face recognition module and index structure set up part, and the monitor video of input is marked automatically.Be input as background characteristics, face characteristic and monitor video to be marked that training module obtains, be output as the search index of monitor video to be marked.

Retrieval module is to specifying monitor video to retrieve, comprising picture query, text query and query video.Be input as the index of specifying monitor video, picture, text or segment video that the user submits to obtain content that the user submits to corresponding picture material in monitor video.Figure 2 shows that the logical relation between main composition module, workflow and each module of system.As shown in the figure, the initial input of system is face database and background sample, through obtaining background model and face characteristic transformation matrix and face characteristic storehouse after the training.Then monitor video is marked, the process of mark at first is that background detects, and to not being that the image of background carries out people's face and detects, the people's face that wherein occurs is carried out eigentransformation and creates index under the index structure.The final user submits text to by user interface, picture or video, and system submits to the difference of content to handle respectively according to the user, and what finally feed back to the user is the position that relevant information occurs in monitor data.

Be the design of system's main modular below:

1) background training module: the background video sample to input calculates, and obtains background model.Adopt algorithm to be based on the hsv color space, calculate the span that each pixel belongs to background.

Input: background video sample.

Output: background model is used for the comparison of background.

2) people's face training module: the people's face in the face database is handled.Adopting algorithm is fisher-face.

Input: face database.

Output: by the transformation matrix that people's face data computation in the face database obtains, the purpose of this matrix is that the conversion of input people face is obtained one-dimensional vector, in order to identification.When obtaining transformation matrix, export the center of each one face, in order to identification.

3) background detection module: incoming frame image and background model are compared, and purpose is to know whether this incoming frame is background, if not background, those zones belong to the prospect scope.

Input: background model, two field picture.

Output: know whether this incoming frame is background, if not background, those zones belong to the prospect scope.

4) people's face detection module:, detect people's face therein for the two field picture of non-background.

Input: two field picture.

Output: detected facial image.

5) face recognition module: for detected facial image, the transformation matrix that uses training to obtain obtains a bit vector, adopts the similarity at Euclidean distance calculating and each center, to realize the purpose of identification.

Input: facial image, transformation matrix.

Output: recognition result.

6) index structure module: input video is marked, and the result according to recognition of face obtains video index, and index structure set up in index.

Input: monitor video.

Output: video index.

7) retrieval module: the user is by user interface input inquiry content, and retrieval module submits to the difference of content format to retrieve according to the user, and by the user interface feedback information.

Input: the inquiry that the user submits to.

Output: the information such as video clips that feed back to the user.

The present invention has special pre-service at the AVS video flowing, no matter be at the gate inhibition's monitoring in real time or the video of processed offline storage, the AVS code stream is not decoded completely, and the compressed domain that is to use AVS is carried out background/non-background class, judge whether present image is background, if just do not carry out follow-up work, improve the treatment effeciency of system with this for background.In using in real time, can also add and use hardware handles to quicken this process.

In the middle of the compression domain of AVS, the motion vector of macro block can reflecting video in the middle of the motion of object.In the background segment, image is static relatively, can make when the people occurs and introduce more movable information in the video.Propose in the document [1] to use motion estimation technique H.264 to carry out the classification of background/non-background.The present invention is used for the AVS code stream with similar algorithms.If

Be the motion vector of a macro block in the present image,

0≤i≤N-1.N is a macro block sum in the present image.Calculate exercise intensity in the present image with following formula:

Formula (1)

Wherein, size _iThe area of representing i macro block.

The simple motion state of using exercise intensity can not characterize object in the present image fully, therefore introduce the scope of moving in another parameter MS presentation video:

MS = Σ_{i = 0}^{N - 1} b_s_{i},

b_s_{i} = \{\begin{matrix} {size}_{i}, \overset{&RightArrow;}{m_{i}} &NotEqual; 0 \\ 0, else \end{matrix}

Formula (2)

In the background image sequence, there is not violent motion in the image, exercise intensity and range of movement all are limited in less numerical value.If the threshold value of MV is mv_min, the threshold value of ms is ms_min, mv_min and ms_min are obtained by the empirical data statistics, it is high more that the more little then background of mv_min and ms_min is differentiated accuracy rate, mv_min and ms_min are big more, and then recall ratio is high more, sets a suitable mv_min and ms_min by balance according to system's actual conditions.When satisfying following condition, judge that present image belongs to background:

MV＜mv_min and MS＜ms_min.

The meaning of carrying out background and non-background class not only is to have improved the efficient of system, also collects the statistical information of each control point on the other hand, thereby infers the environmental information of control point.For example by the distribution of the non-background frames of statistics in the middle of supervisory sequence, just can learn when section is in crowded state in this control point, thereby further suitable deployment is made in this control point, for example the intensive relatively time period improve the frame per second of recording, and reduce frame per second of recording or the like in the time period of stream of people's rareness the stream of people.

Detect through background, do not detect for the image of background carries out people's face judging.People's face detects and adopts the AdaBoost algorithm ^[2]But in order to improve the treatment effeciency of system, we do not carry out global detection, but carry out local detection.

From people's face detected, detected facial image carried out according to from left to right, being scanned into sample vector from top to bottom after size unifies convergent-divergent, then sample vector is carried out dimensionality reduction.The Fisher-Face algorithm that we adopt classical PCA to combine with LDA carries out the extraction of people's face projection properties ^[3](PCA:Principal Components Analysis is in conjunction with pivot analysis; LDA:Linear Discriminant Analysis, linear discriminant analysis).Use LDA on the space after using the PCA dimensionality reduction, obtain the proper vector of the people's face that detects.Adopt the people's face in minimum distance classifier and the storehouse to compare and identification after the feature extraction.

If the sample vector after people's face f process Fisher-Face feature extraction is f ', f '=(u0, u1 ... uk), calculate the distance of itself and training sample then:

d (f', f_{i}^{'}) = \sqrt{Σ_{i = 0}^{k} [{(u_{i} - v_{i})}^{2}]}

Formula (3)

Fi '=(v0, v1 wherein ... vk) i training sample in the library representation, k is the sample dimension.The distance of i training sample in d (f ', fi ') current sample to be identified of expression and the storehouse.

Calculated in f ' and the storehouse behind all samples, found out minimum preceding 5 samples of distance, fi1 ', fi2 ' ... fi5 '.Wherein most samples belong to class c, and class c appoints the sample class that refers to belong to same individual, the more the sort of c class that is of quantity.If 5 samples respectively belong to a class, then with the minimum sample fi1 ' of f ' distance under class as c.We calculate the degree of confidence t of identification with following formula:

t = \frac{Σd (f', f_{ij}^{'} | f_{ij}^{'} &Element; c)}{Σ_{j = 1}^{5} d (f', f_{ij}^{'})}

Formula (4)

As degree of confidence t during less than threshold value t_min, illustrate that people's face is the stranger, f is as a result fed back to the user, new people's face adds in the storehouse with this after the user confirms, otherwise the expression recognition result is reliable and write down the result.T_min is obtained by the empirical data statistics, and t_min is high more, and then accuracy rate is high more, and t_min is low more, and then recall ratio is high more, sets a suitable t_min by balance according to system's actual conditions.

According to foregoing, what summarize the present invention's proposition based on the video monitoring system of AVS and the step of its implementation is: 1, utilize the AVS video camera to obtain the AVS code stream; 2, the AVS code stream is carried out background class, the detection of people's face, background training, the training of people's face; 3, to the identification of comparing of people's face; 4, obtain Query Result.

Description of drawings

Fig. 1 is the core process flow diagram of this video monitoring system.

Fig. 2 is system's main modular and workflow.

Number in the figure: 1 training module; 2 labeling module; 3 retrieval modules; 4 face databases; 5 background sample storehouses; 6 background training modules; 7 people's face training modules; 8 background models; 9 face characteristic transformation matrixs; 10 background detection modules; 11 people's face detection modules; 12 face recognition module; 13 index structure modules; 14 monitor videos; 15 search indexs; 16 retrieval modules.

Embodiment

For example, the present invention is in the application of gate control system, and system can be divided into five parts: front-end camera, AVS video database, Video processing and comparison identification, face database, enter information inquiry.In gate control system, camera position is more fixing, and the angle and the figure viewed from behind of shooting are all fixed, and the variation of light neither be very violent in this indoor environment of office building.Because segmentation and remote storage are not supported in the driving that video camera carries, so will carry at camera on the basis of driving according to application requirements and write driver, in the shooting process, automatically realize the segmentation of video, and the AVS segmentation video storage that will take gained is in the data designated storehouse.Simultaneously, real-time order is handled the AVS code stream of segmentation.At first carry out background class,, then be not for further processing if the segment video is the figure viewed from behind.Detect through background, do not detect for the image of background carries out people's face judging.But in order to improve the treatment effeciency of system, we do not carry out global detection, but carry out local detection, and detection method has detailed elaboration in preamble, just do not repeat at this.Detect as degree of confidence t (the computing method preamble is stated) during by people's face less than threshold value t_min (preamble is stated), t_min can be made as 0.85 in the actual realization of system, give the user less than this value feedback sort like the information in " this person's face is not in the storehouse; be the stranger ", remind the user, can also be after the user confirms new people's face adds in the storehouse with this, the result can be existed in the face database.If greater than t_min, the expression recognition result reliably and in protoplast's face database has this person, inquires about and report this person's name automatically, writes down the time that it enters.This is the present invention's a kind of application in practice.

List of references:

[1] Hui H., Liu H., Wu Y., Liang Y.Video surveillance method based on is standard[J H.264] .Computer Applications, 2005,25 (11), 131-133.[Hui Marsha, Liu Han, Wu Yali, Liang Yanming. a kind of based on video encoding standard intelligent video monitoring technology [J] H.264. " computer utility ", 2005,25 (11), 131-133]

[2]Freund?Y.，Schapire?R.E.A?Decision-Theoretic?Generalization?of?Online?Learning?and?anApplication?to?Boosting.Journal?of?Computer?and?System?Sciences，1997，55(1)：119-139

[3]Belhumeur?P，Hespanha?J.Eigenfaces?vs?Fisherfaces：recognition?using?class?specific?linearprojection[C]，1997，IEEE?Transactions?on?Pattern?Analysis?and?Machine?Intelligence，20(7)，711-720

Claims

1. A video monitoring method based on AVS is characterized in that concrete steps are as follows: first gather AVS code stream by AVS network camera, use the compression domain information in the AVS code stream decoding process to carry out the classification of background and non-background; The result shows that when the current frame is not the background, face detection is performed; when a face is detected, face recognition is performed, that is, the face data is transformed and compared with the training data; before the recognition result is fed back to the user, it is first calculated Confidence t, t indicates the credibility of the current recognition results; when the confidence t is less than the threshold t_min, it is considered that the face does not belong to the data in the current database, it is identified as a stranger, and this result is fed back to the user. After confirmation, add this new face into the library; when the confidence is greater than or equal to the threshold t_min, it indicates that the recognition result has a high degree of credibility, and then record the recognition result and mark the video; here AVS refers to advanced audio and video coding standards.

2. method according to claim 1, is characterized in that the method for described background classification is assuming

i is the motion vector of a macroblock in the current image, 0≤i≤N-1; N is the total number of macroblocks in the current image;

Use the following formula to calculate the motion intensity in the current image:

Formula 1)

Among them, size _i represents the area of the i-th macroblock;

The parameter MS indicates the range of motion in the image:

Formula (2)

When the following conditions are met, it is determined that the current image belongs to the background:

MV<mv_min and MS<ms_min; where mv_min is the threshold of MV, and ms_min is the threshold of MS.

3. The method according to claim 1, characterized in that the method of described face recognition is as follows: after the face images detected in the face detection are scaled uniformly in size, according to from left to right, from top to Downscan into sample vectors, and then reduce the dimensionality of the sample vectors; use the Fisher-Face algorithm combining PCA and LDA to extract face projection features;

Let the sample vector of face f after Fisher-Face feature extraction be f', f'=(u ₀ , u ₁ ... u _k ), and then calculate its distance from the training sample:

Formula (3)

Where f _i '=(v ₀ , v ₁ ... v _k ) means the i-th training sample in the library, k is the sample dimension; d(f', fi') means the current sample to be identified and the i-th training sample in the library The distance of the training samples;

After calculating the distance between f' and all training samples in the library, find the first 5 training samples with the smallest distance from f', f _i1 ', f _i2 '...f _i5 '; the number of the same category in the five training samples More samples belong to category c; if five samples belong to one category, the category of the sample with the smallest distance from f' is taken as c category, and the c category refers to any sample category belonging to the same person; use the following formula Calculate the confidence t of the recognition:

Formula (4)

When the confidence degree t is less than the threshold t_min, it means that the face is a stranger, and the result f is fed back to the user, and the new face is added to the library after the user confirms, otherwise, the recognition result is reliable and the result is recorded.

4. A video monitoring system based on AVS is characterized in that the system includes a training module, a labeling module and a retrieval module:

Described training module, comprises the background training module of monitoring environment background and the human face training module of human face database, implements respectively to environmental background training and to human face training; The input of described training module is human face sample library and background sample library, The output is each face feature and background feature;

The labeling module is used to automatically label the monitoring video of the input, and it includes a background detection module, a face detection module, a face recognition module and an index structure building module; the input of the labeling module is the background feature obtained by the training module , facial features and the surveillance video to be marked, and the output is the retrieval index of the surveillance video to be marked;

The retrieval module is used to retrieve the specified monitoring video, including image query, text query and video query; the input is the index of the specified monitoring video, the picture, text or small segment of video submitted by the user, and the output is the content submitted by the user in the monitoring Corresponding video segment information in the video;

The background training module is used to calculate the input background video sample to obtain the background model, and the algorithm is based on the HSV color space to calculate the value range that each pixel belongs to the background; the input of the background training module is the background video sample , the output is the background model for background comparison;

The human face training module is used to process the human face in the human face bank, and the algorithm is fisher-face; the input of the human face training module is a human face bank sample, and the output is a human face sample in the human face bank The calculated transformation matrix and the center of each face, the transformation matrix transforms the input face sample into a one-dimensional vector;

The background detection module is used to compare the input frame image with the background model, the purpose is to know whether the input frame is the background, if not the background, the input frame belongs to the foreground range; the input of the background detection module is The background model and the input frame image, the output is the result of judging whether the input frame is the background;

The human face detection module is used to detect human faces in the frame image of non-background; the input of the human face detection module is the frame image of non-background, and the output is the detected human face image;

The face recognition module is used to use the transformation matrix to obtain a one-dimensional vector for the detected face image, and use the Euclidean distance to calculate the similarity between the vector and each face center; the face recognition module The input is the detected face image and transformation matrix, and the output is the result of face recognition;

The index structure building module is used to mark the input video, obtain a video index according to the result of face recognition, and build an index structure for the index; the input of the index structure building module is a surveillance video, and the output is a video index;

The retrieval module is used to enable the user to input query content through the user interface, perform retrieval according to different content formats submitted by the user, and feed back information through the user interface; the input of the retrieval module is the query content submitted by the user,

The output is video segment information fed back to the user. the