Disclosure of Invention
The invention aims to provide an intelligent reading learning method based on coordinate recognition, which does not need to prefabricate codes on books, gets rid of the limitation of reading contents due to the limitation of the codes, can ensure the accuracy of broadcasting the contents, and is convenient for accurately recognizing the reading actions.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
An intelligent reading learning method based on coordinate recognition comprises the following steps:
S1, placing a printed matter on an auxiliary pad, wherein a placement area and a marking area are formed on the auxiliary pad, the placement area is used for placing the printed matter, the marking area surrounds the placement area, and a staff object for representing position information is arranged on the marking area;
S2, shooting a printed matter page to be learned by using a camera of a learning rod, and identifying page information to obtain original page information;
S3, utilizing the selected reference object to touch an interested area on a printed matter page to be learned, continuously shooting through a camera of a learning rod to obtain a reference page image containing the selected reference object, the printed matter and an auxiliary pad, and identifying the selected reference object from the reference page image to obtain the selected reference object;
S4, calculating the position information of the selected reference object on the printed matter, which is identified in the step S3, based on the staff object of the auxiliary pad on the reference page image;
s5, acquiring a multimedia file preset at a corresponding position based on the position information acquired in the S4 and the original page information acquired in the S2, and playing the multimedia file.
Preferably, the selected reference object is a human hand, a finger, a pen-shaped object or an object with a light emitting device at the tip.
Preferably, the step S3 further includes performing a point touch recognition from the reference page image to obtain gesture command information, and the step S5 further includes selecting a playing mode of the multimedia file based on the gesture command information.
Preferably, the step S2 comprises the following sub-steps:
S21, acquiring an original page image of a printed matter in advance and extracting feature points, so as to acquire a page feature library, and forming a corresponding area play multimedia content library in a mode that certain specific areas correspond to certain multimedia files on the original page of the printed matter in advance;
S22, continuously shooting the page of the printed matter to be learned by using a camera of the learning rod, obtaining an image of the page to be learned, extracting characteristic points, searching the page to be learned in the page characteristic library based on the extracted characteristic points, and matching page characteristics to obtain original page information of the page to be learned in the page characteristic library.
Preferably, the feature point extraction is realized by feature extraction SIFT and SURF algorithms.
Preferably, the feature point extraction is realized by the following method:
graying the image;
extracting characteristic points by using a key point detection algorithm;
performing feature point direction identification based on histogram statistics;
and describing the feature points to obtain feature descriptors.
Preferably, the page feature matching is realized through a feature value Euclidean distance, cosine similarity of feature vectors and a correlation coefficient algorithm.
Preferably, the page feature matching is achieved by the following method:
performing dimension reduction, hash conversion and sequencing on feature descriptors corresponding to the feature points extracted from the learning page image, comparing the hash value with hash values of the feature points stored in a page feature library, and recognizing that the pair of feature points are matched if the distance is smaller than a preset first threshold value;
And counting the number of the matched characteristic points, and if the number of the matched characteristic points is larger than a preset second threshold value, recognizing that the learning page image is matched with the corresponding original page image.
Preferably, the step S4 includes the following sub-steps:
S41, carrying out coordinate mapping based on the scale object of the auxiliary pad on the reference page image, and calculating the position information of the selected reference object on the auxiliary pad, which is identified in the step S3;
S42, converting coordinate values of the selected reference object on the auxiliary pad into position information of the selected reference object on the printed matter based on the mapping relation between the auxiliary pad and the printed matter.
Preferably, the click action of the selected reference object in step S4 includes a single click or a double click.
After the technical scheme is adopted, compared with the background technology, the invention has the following advantages:
the invention does not need to prefabricate codes on books, gets rid of the limitation of the read-by-point content due to the limitation of the codes, can ensure the accuracy of content broadcasting, and is convenient for accurately identifying the read-by-point action.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Examples
The invention discloses an intelligent reading learning method based on coordinate recognition, which is shown in the drawings in combination with fig. 1 and 2, and comprises the following steps:
S1, placing the printed matter on an auxiliary pad. Referring to fig. 2, the auxiliary pad 100 is formed with a placement area for placing the printed matter 300 and a marking area surrounding the placement area, on which a scale 110 representing position information is provided. The learning rod 200 is placed on the outside of the auxiliary pad 100 for photographing.
S2, shooting a printed matter page to be learned by using a camera of the learning rod, and identifying page information to obtain original page information. The method specifically comprises the following steps of:
s21, acquiring an original page image of the printed matter in advance, extracting feature points, performing dimension reduction, hash conversion and sequencing, so as to obtain a page feature library, and forming a corresponding area play multimedia content library in a mode that certain specific areas correspond to certain multimedia files on the original page of the printed matter in advance.
The feature point extraction method may employ any feature point extraction algorithm including, but not limited to SIFT, SURF, and algorithm variations thereof, and the present invention is not particularly limited. In this embodiment, feature point extraction may be achieved by the following method:
a. and (5) image graying processing. The acquired image is a color image (for example, an RGB three-channel color image), and the gray-scale processing needs to be performed first to facilitate the execution of the subsequent steps. In this embodiment, the calculation formula of graying adopts:
Gray=(R*30+G*59+B*11+50)/100
Wherein Gray is a Gray value.
B. And extracting the characteristic points by using a key point detection algorithm. And continuously downsampling the original image to obtain a series of images with different sizes, further carrying out Gaussian filtering on the images with different scales, subtracting the two images after the Gaussian filtering of the same image with similar scales to obtain a Gaussian difference image, and carrying out extremum detection, wherein extremum points meeting curvature conditions are characteristic points. The gaussian difference image D (x, y, σ) operates as follows, G (x, y, σ) is a gaussian filter function, I (x, y) corresponds to the original image, and L (x, y, σ) represents the gaussian filtered image of the scale σ:
D(x,y,σ)=(G(x,y,σ(s+1))-G(x,y,σs)))*I(x,y)
=L(x,y,σ(s+1))-L(x,y,σ(s))
c. And carrying out feature point direction identification based on histogram statistics. After the gradient calculation of the feature points is completed, the histogram is used for counting the gradient and the direction of the pixels in the neighborhood. The gradient histogram divides the range of directions from 0 to 360 degrees into 18 bins (bins), with 20 degrees per bin. The peak direction of the histogram represents the main direction of the feature point. L is the scale space value of the key point, and the calculation formulas of the gradient m and the direction theta of each pixel point are as follows:
θ(x,y)=tan-1((L(x,y+1)-L(x,y-1))/L(x+1,y)-L(x-1,y)))
And describing the feature points to obtain feature descriptors. Determining a neighborhood with the size of 21 multiplied by 21 for the feature points, and rotating the neighborhood to a main direction; calculating the horizontal gradient and the vertical gradient of the pixel points in the neighborhood, so that each feature point determines a feature descriptor with the size of 19 multiplied by 2=722; the description of the feature points includes coordinates, dimensions, directions. It should be noted that, since the obtained feature descriptors are high-dimensional (722-dimensional in this embodiment), for facilitating the subsequent processing, the dimension reduction and hash transformation is performed, in this embodiment, the dimension reduction processing is performed by using the principal component analysis dimension reduction method, that is, PCA in fig. 3, the dimension reduction processing is performed in 20 dimensions, and after the locally sensitive hash transformation, that is, LSH in fig. 3, the feature descriptors in 20 dimensions are mapped into 1 32-bit floating point values. The specific operation of PCA is as follows:
Firstly, constructing a feature matrix X by using feature data of a large number of acquired images, obtaining feature values of the matrix X, sorting the feature values according to the size, and obtaining feature vectors corresponding to the feature values to form a transformation matrix W. Under the condition of the existing transformation matrix W, for the characteristic data Y of any acquired image, Z=YW T, the original characteristic matrix Y is projected to the matrix Z, the high-dimensional characteristic matrix Y is reduced to a low-dimensional new characteristic matrix Z, and the new characteristics are linear irrelevant.
The specific operation of LSH is as follows:
(1) Selecting a locally sensitive hash function meeting (d 1, d2, p1, p 2) sensitivity;
(2) Determining the number L of hash tables according to the accuracy of the search result, the number K of hash functions in each table and parameters related to the sensitive hash;
(3) All data are hashed into corresponding barrels through a local sensitive hash function, and one or more hash tables are formed.
S22, continuously shooting the page of the printed matter to be learned by using a camera of the learning rod, obtaining an image of the page to be learned, extracting characteristic points, searching the page to be learned in a page characteristic library based on the extracted characteristic points, and matching page characteristics to obtain original page information of the page to be learned in the page characteristic library. The algorithm adopted by the page feature matching comprises but is not limited to the Euclidean distance of the feature value, the cosine similarity of the feature vector, the correlation coefficient and the like, and the invention is not particularly limited. Referring to fig. 3, in this embodiment, the page feature matching is achieved by the following method:
And performing dimension reduction, hash conversion and sequencing on feature descriptors corresponding to the feature points extracted from the page image to be learned, comparing the hash value with the hash value of the feature points stored in the page feature library, and if the distance is smaller than a preset first threshold value, recognizing that the pair of feature points are matched. The matching calculation distance process is to calculate the distance between the hash value of the feature point and 2L data in the page feature library, wherein the distance is defined as, but not limited to, the absolute value of the two differences, and the feature point pair is judged to be matched when the distance is smaller than a set first threshold value.
And counting the number of the matched characteristic points, and if the number of the matched characteristic points is larger than a preset second threshold value, recognizing that the page image to be learned is matched with the corresponding original page image.
S3, utilizing the selected reference object to point contact the interested area on the printed matter page to be learned, continuously shooting through a camera of the learning rod to obtain a reference page image containing the selected reference object, the printed matter and the auxiliary pad, identifying the selected reference object from the reference page image, obtaining the selected reference object, identifying the point contact from the reference page image, and obtaining gesture instruction information.
In this embodiment, the reference object is selected by a human finger. Of course, the selected reference object is not limited to a human finger, but may be a pen-shaped object, an object with a light emitting device at its tip, or the like. The point-touch actions involved in this step include, but are not limited to, single-click, double-click, or other actions with distinct features. Different gesture instructions may be represented by different click actions, for example, a single click is a click, and a double click is a call out of the multimedia presentation.
The finger tip recognition can be realized by the following method:
s31, converting the color space of the reference page image into a YCbCr color space. For reference page images in RGB format, the color space conversion is achieved by the following formula:
Y=0.257*R+0.564*G+0.098*B+16
Cb=-0.148*R-0.291*G+0.439*B+128
Cr=0.439*R-0.368*G-0.071*B+128。
S32, skin color segmentation, and obtaining candidate hand areas based on histogram statistics. Respectively establishing histograms for Y, cb and Cr, selecting lower limits Y1, cb1 and Cr1 according to peaks of the histograms, selecting upper limits Y2, cb2 and Cr2, and judging pixels meeting Y1< Y2, cb1< Cb2 and Cr1< Cr2 as candidate skin color points; binarizing the image, setting non-candidate skin color points to 0, and setting candidate skin color points to 1; and (3) expanding and communicating the binary image, reserving a maximum communication area with a value of 1, calculating the area S0 of the communication area, and if S0/S > Th1 and Th1 are preset thresholds, judging the communication area as a candidate hand area.
S33, hand outline recognition is performed, and whether the candidate player area is a real hand area in a click-reading state is judged. And (3) carrying out longitudinal projection on the candidate hand region, dividing the projection image into a left part and a right part by the average height, calculating the length l and the height h of the lower part region of the projection image, acquiring the image width w, and judging the connected region as a real hand region if h/w > Th2 and h/l < Th3 (Th 2 and Th3 are preset thresholds) are simultaneously met.
S34, calculating the center of gravity of the region. Assuming that there are N pixel points of the hand region, the coordinates in the image are (xi, yi), i=1 … N, and the region barycentric coordinates xc=sum (xi)/N, yc=sum (yi)/N.
S35, identifying the fingertip position based on the center of gravity of the region. And extracting the outline of the hand region, calculating the distance between the outline pixel point of the real hand region and the position of the gravity center of the region, and judging the pixel point with the largest distance as the fingertip position. (xp, yp) =argmax ((xi-xc) ×(xi-xc) + (yi-yc) ×i=1 … M, where (xp, yp) is a fingertip position coordinate and M is a contour pixel count.
The present invention fingertip recognition can also be achieved by the following method, as shown in fig. 4:
firstly, skin color segmentation is carried out, and then candidate region segmentation is carried out. And finding out the maximum value and the minimum value xmin, ymin, xmax and ymax from the coordinates of the pixel points of the candidate hand area, acquiring an image of the rectangular area (xmin, xmax, ymin and ymax), feeding the image into a convolutional neural network, and positioning fingertips in the identified finger area.
The specific process of the convolutional neural network for identifying the finger is as follows:
1. Collecting training samples: collecting a large number of images containing finger reading, marking the finger images, preprocessing the finger images, and constructing a training sample set;
2. Constructing a network model: constructing a CNN feature extraction network and a discrimination network, wherein the CNN feature extraction network consists of a convolution layer, an excitation function layer and a pooling layer, and the discrimination network consists of a region-of-interest pooling layer, a full connection layer, an excitation function layer and a Softmax layer;
3. Training a network model: initializing a CNN feature extraction network and a discrimination network, and training the CNN feature extraction network by inputting a finger image with a mark to obtain a CNN feature extraction network model; the judging network trains the candidate areas according to the feature images provided by the CNN feature extraction network to obtain a judging network model;
4. And (3) constructing a detection model: combining the CNN feature extraction network model and the discrimination network into a detection network, and training the network by utilizing finger image training data to obtain a finger detection and key point positioning network model;
5. finger detection: and detecting the finger outline and positioning the key points by using the obtained detection network model to obtain a finger image candidate frame and positioning the finger key point image.
The fingertip positioning process is as follows:
After the finger is identified, the contour of the finger is extracted according to the finger area obtained by skin color segmentation, the fingertip position is (xp, yp) =argmin (yi), i= … M, and M is the contour pixel point number.
And S4, calculating the position information of the selected reference object on the printed matter, which is identified in the step S3, based on the scale object of the auxiliary pad on the reference page image. The method specifically comprises the following steps of:
s41, carrying out coordinate mapping based on the scale object of the auxiliary pad on the reference page image, and calculating the position information of the selected reference object on the auxiliary pad, which is identified in the step S3.
S42, converting coordinate values of the selected reference object on the auxiliary pad into position information of the selected reference object on the printed matter based on the mapping relation between the auxiliary pad and the printed matter.
The coordinate position of the reference object can be selected by adopting the following two methods:
the method 1 is that according to the mark points on the auxiliary pad, as projection mapping points, a transformation matrix is generated, the real-time image is projected and transformed to obtain a rectangular image, and meanwhile, the coordinates of a selected reference object (finger) are input into the transformation matrix to obtain transformed rectangular inner coordinates, and the transformed rectangular inner coordinates correspond to a printed matter.
And 2, respectively calculating and selecting the distance between the coordinates of the reference object (finger) and the identification point according to the identification point on the auxiliary pad, then knowing the distance between the coordinates of the three points before transformation and the coordinates of the finger, the coordinates of the three points in the rectangle after transformation and the distance between the calculated coordinates and the three points (proportional relation with the distance before transformation), and calculating three circle intersecting points on the plane after transformation.
S5, acquiring a preset multimedia file at a corresponding position based on the position information acquired in the S4 and the original page information acquired in the S2, selecting a playing mode of the multimedia file based on gesture instruction information, and playing the multimedia file. The multimedia file may be an audio file, or may be an image or video, and the present invention is not particularly limited. The multimedia playing can be performed by using a display screen (the display screen is an optional component) or a loudspeaker integrated on the learning stick, and the multimedia playing can also be performed by using a WIFI or Bluetooth function of the learning stick, connecting an external intelligent terminal and using a screen and a loudspeaker of the external intelligent terminal, wherein the WIFI, the Bluetooth and the external intelligent terminal are optional components. The learning stick can be integrated with a projection device which is used for playing the multimedia file and is an optional component. The original page image library, the page feature library and the multimedia file corresponding to the printed matter related to step S2 may be stored in the learning stick or may be stored in a storage space of an external server. The learning stick can be integrated with a projection device, and the projection device is an optional component.
As can be seen from the above description, the present invention realizes recognition of a click-to-read operation and multimedia content play by means of a learning stick and an auxiliary pad. When the auxiliary pad is used, a printed matter is required to be placed on a placement area of the auxiliary pad in a standard mode, an original page information is obtained by shooting a printed matter page image through the learning rod, finger auxiliary positioning identification is achieved through the staff on the auxiliary pad, and finally a corresponding multimedia file can be obtained based on the original page information and the finger position. For ease of identification, the color of the auxiliary pad should be significantly different from the printed matter.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.