Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Unless the context clearly dictates otherwise, the elements and components of the present invention may be present in either single or in multiple forms and are not limited thereto. Although the steps in the present invention are arranged by using reference numbers, the order of the steps is not limited, and the relative order of the steps can be adjusted unless the order of the steps is explicitly stated or other steps are required for the execution of a certain step. It is to be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
As shown in fig. 1, in one embodiment, there is provided a product classification method including:
step 102, extracting product text features according to product texts for describing products to be classified.
The product text refers to a text for describing a product to be classified, and includes characters, symbols, numbers and the like. The product text may be stored in a product text document such that one product text corresponds to one product text document.
Specifically, the process of extracting the product text features is a process of quantizing feature words extracted from the product text to represent the product text, so as to convert the original product text without a structure into information which can be identified and processed by a structured computer and used for classification processing. Product text features can be extracted from the product text using existing text feature extraction methods, such as Principal Component Analysis (PCA), simulated annealing algorithm (SA), and so on.
And 104, extracting product image characteristics according to the product image of the product to be classified.
The product image refers to an image including an image of a product to be classified. Color features (such as color histograms), texture features, or shape features of the product image may be extracted as the product image features.
And 106, generating product characteristics of the products to be classified according to the product text characteristics and the product image characteristics.
After the product text characteristic and the product image characteristic are obtained, the product text characteristic and the product image characteristic can be spliced to obtain the product characteristic of the product to be classified. Specifically, a vector representing the product text feature and a vector representing the product image feature can be connected to form a vector representing the product feature, so that the feature splicing is realized.
And step 108, inputting the product characteristics of the product to be classified into a product classification model obtained by pre-training to obtain a classification result.
And before classifying the products to be classified, training according to a training sample set in advance to obtain a product classification model. During classification, the product characteristics of the product to be classified are input into the product classification model obtained through training, and then a classification result can be obtained.
In one embodiment, the training sample set includes a plurality of product samples corresponding to preset categories, the product samples corresponding to sample texts and sample images used for describing the product samples. In this way, the training sample set includes product samples of preset categories, and each preset category corresponds to a plurality of product samples respectively. Each product sample corresponds to a sample text and at least one sample image.
As shown in fig. 2, the product classification method further includes a step of training to obtain a product classification model, which includes steps 202 to 208:
step 202, extracting sample text features according to sample texts of the product samples in the training sample set.
Specifically, sample text features may be extracted from sample texts corresponding to individual product samples in the training sample set using the same means as extracting product text features from product texts of products to be classified. Sample text features can be extracted from the sample text using existing text feature extraction methods, such as Principal Component Analysis (PCA), simulated annealing algorithm (SA), and the like.
And step 204, extracting sample image characteristics according to the sample images of the product samples in the training sample set.
Specifically, sample image features are extracted from sample images corresponding to individual product samples in a training sample set using the same means as extracting product image features from product images of products to be classified. Color features (such as color histograms), texture features, or shape features, etc. of the respective sample images may be extracted as the sample image features.
And step 206, generating sample characteristics according to the sample text characteristics and the sample image characteristics.
After the sample text features and the sample image features are obtained, the sample text features and the sample image features can be spliced to obtain the sample features of the product samples. Specifically, a vector representing the sample text features and a vector representing the sample image features may be concatenated to form a vector representing the sample features, thereby achieving the stitching of the features.
And step 208, training according to the sample characteristics to obtain a product classification model based on the support vector machine.
In this embodiment, a Support Vector Machine (SVM) method is adopted to train and obtain a product classification model. The basic idea of the support vector machine approach is to establish a hyperplane or a series of hyperplanes in a high dimensional space such that the distance between the hyperplane and the nearest neighboring training sample is maximized. The product classification model based on the support vector machine can also be obtained by using the existing support vector machine training method. An important task in the SVM method is the selection of the kernel function. When the sample characteristics have heterogeneous information, the sample scale is large, the multidimensional data are irregular or the data are not flat in the high-order characteristic space distribution, it is not reasonable to process all samples by adopting a single-core mapping mode, namely, a plurality of kernel functions need to be combined, namely, the multi-core learning method.
There are many methods for synthesizing kernels, in this embodiment, a multi-kernel learning method (UFO-MKL) based on sparse coding is adopted, and the improvement of sparsity can reduce redundancy and improve operation efficiency in some cases.
Specifically, let the obtained sample feature be X ∈ X, the preset category be Y ∈ Y ═ 1,2, …, F, and F be the total number of the preset categoriesj ═ 1., F, where,is a function corresponding to the jth preset category.
Definition of Wherein wjIs composed ofThe corresponding hyperplane coefficient;defining the module value as shown in formula (1), wherein | · |. non |pThe p-norm of the vector.
Formula (1)
The training of the multi-core product classification model may be defined as an optimization problem of equation (2),
formula (2)
Wherein,in the form of a coefficient-specifying term,and N is the number of product samples of the training set to classify the error loss cost items.Where = denotes assignment, λ, α is a factor coefficient, and p ═ 2logF (2logF-1) is a norm factor.
Is a simple cost function that is commonly used,as partial derivative of the cost function, then the coefficient solving algorithm based on UFO-MKL is as in 11) -18):
11) initializing factor coefficients lambda, alpha and iteration cycle times T;
12) initialization factorVariables ofThe variable q is 2 logF;
13)、fort=1,2,…,Tdo;
14) randomly acquiring training samples (x)t,yt);
15) Updating variables
16) And calculating
17) Update the coefficient
18)、endfor。
Wherein, the steps 13) to 18) represent that T respectively takes values of 1,2, … and T, and the steps are repeatedly executedStep 13) to step 17) until T = T, updating the coefficient wjThe back loop stops and the algorithm ends. To obtainCorresponding hyperplane coefficient wjThen, a series of hyperplanes in a high-dimensional space can be established, so that a product classification model is obtained.
In this embodiment, through steps 202 to 208, a training method for training a product classification model based on a support vector machine is provided, and the calculation efficiency can be improved by using the training method.
In the product classification method, through the steps 102 to 108, the product text features and the product image features of the product to be classified are extracted, and then the product features are generated according to the product text features and the product image features, so that the product features are used for classification to obtain a classification result. Due to the comprehensive consideration of the text characteristics and the image characteristics of the products to be classified, compared with the classification based on the text information of the products, the classification accuracy is improved; and the automatic classification of the products is possible due to the provision of the classification accuracy, so that the labor cost in the product classification process can be saved.
In one embodiment, the sample text correspondence is stored in a sample document. The sample text corresponds to the sample document one to one. As shown in fig. 3, step 102 specifically includes steps 302 to 308.
Step 302, performing word segmentation on the product text to obtain candidate words.
Word segmentation is the process of dividing a text sequence into individual words or phrases. Specifically, in one embodiment, the candidate words may be obtained by performing chinese segmentation on the product text using the chinese lexical analysis system ICTCLAS (institutional of computing technology, chinese lexical analysis system) based on the multi-layer hidden markov model at the chinese academy of sciences research institute of computing technology. The word segmentation precision reaches 98.45 percent.
And step 304, screening out product characteristic words from the candidate words according to a preset evaluation function.
Specifically, each feature in the feature set is evaluated by constructing an evaluation function, and each feature is scored, so that each candidate word obtains an evaluation value, which is also called a weight. And then, sorting all the features according to the weight value, and extracting a preset number of optimal features as a feature subset of the extraction result.
In one embodiment, step 304 specifically includes at least one of steps 21) to 25), and preferably includes all of steps 21) to 25):
21) and calculating the occurrence times of the candidate words in the sample document, and taking the candidate words with the occurrence times larger than or equal to a time threshold value as product characteristic words.
In step 21), the evaluation function is preset as a Term Frequency (TF) function. Specifically, all candidate words are traversed, the frequency of occurrence of each candidate word in a sample document is obtained, a frequency threshold (for example, 10) is set, candidate words which have small contribution to classification and have the frequency of occurrence smaller than the frequency threshold are deleted, and the candidate words larger than or equal to the frequency threshold are selected as product feature words.
22) And calculating the proportion of the sample documents containing the candidate words in the total number of the sample documents, and taking the candidate words with the corresponding proportion in a preset range as product characteristic words.
Specifically, first, the document frequency P of each candidate word is calculated according to formula (3) I.e., the sample document containing the candidate word accounts for the total number of sample documents. Equation (3) is a preset evaluation function, called Document Frequency (DF) function.
Formula (3)
Wherein n is For sample documents containing candidate words, n is the total number of sample documents.
And setting a preset range, such as (0.005, 0.08), and screening out candidate words in the preset range as product characteristic words.
23) And calculating the information gain weight of the candidate word, and taking the candidate word of which the corresponding information gain weight is larger than the information gain weight threshold as a product characteristic word.
Specifically, each candidate word is first calculated according to formula (4)kThe information gain weight IG (k). Equation (4) is a preset evaluation function, called Information Gain (IG) function.
Formula (4)
Wherein,kdenotes the kth candidate word, yiRepresenting a preset category, F representing the number of preset categories, P (y)i) Denotes yiProbability that a sample document corresponding to a category appears in a sample document set (a set of all sample documents), P: (k) Representing the probability of a sample document containing a candidate word appearing in the sample document set, P (y)i|k) Indicating that the sample document contains candidate wordskWhen belongs to yiThe conditional probability of a class is determined,indicating that the training sample set does not contain candidate wordskWhen belongs to yiConditional probability of a class.
Setting information gain weight threshold value, such as 0.006. After the information gain weight of each candidate word is solved, the candidate words with the information gain weight larger than the threshold value of the information gain weight are selected as the product feature words.
24) And calculating mutual information values of the candidate words, and taking the candidate words with the corresponding mutual information values larger than a mutual information value threshold value as product characteristic words.
Specifically, each candidate word is first calculated according to formula (5)kWith each category yiThe mutual information value MI of (k,yi)。
Formula (5)
Equation (5) can also be expressed as equation (6)
MI(k,yi)=log(k|yi)-logP(k) Formula (6)
Wherein P is: (k,yi) Candidate words for occurrences in a sample document setkAnd belongs to a preset category yiProbability of appearance of sample document of (1), P: (k) As candidate wordskProbability of occurrence, P (y), in the entire training sample seti) Is corresponding to yiProbability of occurrence of a sample document of a category in the entire sample document set, P: (k|yi) Is a characteristic wordkAt yiConditional probabilities of occurrence in the sample documents of the category. Equation (5) or (6) is a predetermined evaluation function, called Mutual Information (MI) function.
And setting a mutual information value threshold, such as 1.54, and selecting candidate words larger than the mutual information threshold as product characteristic words.
25) And calculating the correlation degree of the candidate words and the preset categories according to the probability of whether the candidate words appear in the training sample set and whether the candidate words belong to the preset categories, and taking the candidate words with the corresponding correlation degree larger than the threshold value of the correlation degree as the product characteristic words.
Specifically, each candidate word is first calculated according to formula (7)kAnd a predetermined category yiCorrelation degree between CHI and CHI: (k,yi). Equation (7) is a predetermined evaluation function, called the Chi-square, CHI function.
Formula (7)
Wherein n is the total number of sample documents in the training sample set, P: (k,yi) Candidate words for occurrences in a sample document setkAnd belongs to a preset category yiThe probability of the occurrence of the sample document of (c),for candidate words not present in the sample document setkAnd do not belong to the preset category yiThe probability of the occurrence of the sample document of (c),candidate words for occurrences in a sample document setkAnd do not belong to the preset category yiThe probability of the occurrence of the sample document of (c),for candidate words not present in the sample document setkAnd belongs to a preset category yiThe probability of occurrence of the sample document.
And setting a correlation threshold, for example, 10, and screening out candidate words larger than the threshold as product feature words.
Through all the steps from the step 21) to the step 25), five groups of product feature words can be generated, five product text features are correspondingly generated, the capability of describing the product to be classified by the product text features can be obviously improved, and the classification accuracy is improved.
In one embodiment, step 304 is preceded by: candidate words contained in a preset stop word list are filtered out. Some characters or words which can cause classification interference may exist in the candidate words, such as the mood words, the auxiliary words, and the like. Therefore, the stop word list is preset, and the characters or words which can cause classification interference are added into the stop word list, so that candidate words contained in the preset stop word list are filtered, unnecessary calculation can be avoided, and the time required by product classification is saved.
And step 306, calculating the weight of the product characteristic words according to the frequency of the product characteristic words appearing in the sample documents, the total number of the sample documents and the number of the sample documents containing the product characteristic words.
Specifically, after the product feature words are screened out through the steps 21) to 25), the product feature word weight W of each group of product feature words (step 21) to 25) is respectively calculated according to the formula (8) and a group of product feature words is respectively generated in each step)i:
Wi=TFi(γ, d) × n/DF (γ) equation (8)
Wherein, WiThe weight value, TF, of the product feature word of the ith product feature wordi(y, d) is the frequency of appearance of the product feature word y in the sample document d, n represents the total number of sample documents, DF (y) is the document number containing the product feature word y.
And 308, generating the product text characteristics of the products to be classified according to the product characteristic word weight.
Specifically, after the product feature word weight of each product feature word obtained in each of steps 21) to 25) is calculated according to formula (8), the product text can be converted into a vector using the product feature word as a dimension, and the attribute value of each dimension is the weight of the product feature word. From step 21) to step 25) a vector, i.e. a product text feature, is derived. For a product text, five vectors, namely five product text features, are obtained from steps 21) to 25), so that the product text features of the product to be classified are obtained. By adopting the five product text characteristics, the accuracy of product classification can be improved.
In this embodiment, through the steps 302 to 308, the product text features that can accurately represent the product text are extracted from the product text of the product to be classified, which is beneficial to correctly classifying the product to be classified.
As shown in FIG. 4, in one embodiment, step 104 includes steps 402-408:
step 402, a plurality of image small blocks with the same size are segmented from the product image of the product to be classified, and an overlapping part exists between the adjacent image small blocks.
Specifically, one product to be classified corresponds to at least one product image, and image patches are densely cut on each image, the size of each image patch being 16 wide and 16 long. As shown in fig. 5, during cutting, the cutting start point is moved in steps of 8 and 8 in the horizontal and vertical directions of the product image of the product to be classified, that is, there is an overlapping portion between adjacent image patches, so that each picture will be cut into many image patches.
Step 404, extracting gradient histogram features of the image patches.
Specifically, step 404 includes steps 31) to 32):
step 31), dividing each image small block into a plurality of image units with the same size and without overlapping.
Specifically, as shown in fig. 6, the image patches are divided into 4 equal parts in the horizontal and vertical directions, respectively, to obtain 16 image cells Ci,i=1,2,…,16。
And 32) counting the gradient histogram features of 8 directions on each image unit, and splicing the gradient histogram features of the image unit corresponding to each image small block to obtain the gradient histogram features of each image small block.
Specifically, each image cell C is first calculated according to formula (9) and formula (10)iGradient value M (a, b) and direction β (a, b) of each pixel in (a, b):
formula (9)
Formula (10)
Wherein M (a, b) is each image cell CiThe gradient value of each pixel in β (a, b) for each image unit CiIn the direction of each pixel point, a, b are each image unit CiThe abscissa and ordinate of each pixel point in the image.
And then according to each picture cell CiDirection β (a, b) of each pixel point, and each image element CiThe gradient value M (a, b) of each pixel point is accumulated to a vector hiI is 1,2, …,16, thereby obtaining a gradient histogram feature h of the image uniti. Then, the image unit C corresponding to each image small block is processediIs characteristic of the gradient histogram ofiAnd (h) obtaining the gradient direction histogram feature feat of the image small blocks after splicing1,h2,…,h16). Where, feat is a 128-dimensional feature vector.
In this embodiment, through steps 31) to 32), the gradient histogram feature of the image patch is extracted, so as to generate the product image feature according to the gradient histogram feature of the image patch.
And step 406, calculating the euclidean distance between the gradient histogram feature of each image patch and each clustering center in the clustering center set obtained by pre-learning, and counting the clustering centers closest to the euclidean distance between the gradient histogram feature of each image patch and each clustering center in the clustering center set.
And step 408, generating product image characteristics according to the clustering centers and counting results counted by the gradient histogram characteristics corresponding to each image small block.
Firstly, it is necessary to obtain a cluster center set through steps 41) to 44) in advance by learning:
and 41) selecting product samples corresponding to preset selection numbers of each preset category from the training sample set.
Specifically, from the product samples corresponding to the preset categories in the training sample set, product samples with preset selection numbers are respectively selected corresponding to each preset category. For example, the training sample set has F product samples in total, and then M product samples are selected for each category, so that M × F product samples are obtained in total.
And 42) dividing the product sample image corresponding to the selected product sample into a plurality of image sub-blocks with the same size, wherein the adjacent image sub-blocks have overlapping parts.
And 43), extracting gradient histogram features of the image subblocks.
Step 42) to step 43) to segment image sub-blocks from the product sample image corresponding to the selected product sample, so as to extract the gradient histogram features of the image sub-blocks, which is basically the same as the step of segmenting image small blocks from the product image of the product to be classified in the above step 402) to step 404), so as to extract the gradient histogram features of the image small blocks, and the processing objects are different, and are not described herein again.
And 44), clustering the gradient histogram features of the image sub-blocks into clustering centers with preset clustering center numbers to obtain a clustering center set.
Specifically, by steps 41) to 43), a gradient histogram feature set FEAT ═ { FEAT ═ is obtained for the image subblocks1,feat2,…,featmAnd m is the total number of the image subblocks. Presetting the number of clustering centers to be 1024, clustering the feature set FEAT by using a k-means (k-means clustering) clustering algorithm to obtain 1024 clustering centers, and marking as Dict ═ d1,d2,…,d1024And the Dict is a cluster center set obtained by learning.
Through step 404, a set of gradient histogram features, Feat ═ Feat, { of a plurality of image patches of the product to be classified is obtained1,feat2,…,featsAnd s is the total number of the image small blocks.
In steps 406 to 408, specifically, the product image is initialized to have the all-zero vector R ═ R with the same length as the number of elements in the cluster center set first1,r2,…,r1024]For each gradient histogram feature Feat of a set Feat of gradient histogram features for a plurality of image patches of a product to be classifiediCalculating Euclidean distance between the Euclidean distance and each cluster center point in the cluster center set Dict according to the formula (11), and counting gradient histogram feature feat from each image patchiThe cluster center point with the smallest euclidean distance.
Formula (11)
Wherein mindxHistogram feature feat representing distance per gradientiThe position of the cluster center point with the smallest Euclidean distance, dj∈Dict。
Then, the statistical cluster center points are counted according to equation (12).
Formula (12)
The operations in steps 406 to 408 correspond to the gradient histogram feature feat for each image patchiVoting is performed on the cluster center set Dict. The finally obtained vector R ═ R1,r2,…,r1024]Namely the generated product image characteristics.
As shown in FIG. 7, in one embodiment, step 202 includes steps 702-708:
and step 702, performing word segmentation on the sample text to obtain a word to be selected.
Step 704, selecting sample characteristic words from the to-be-selected words according to a preset evaluation function.
In one embodiment, step 704 specifically includes at least one of step 51) to step 55), and preferably includes all of steps 51) to step 55):
and step 51), calculating the occurrence frequency of the to-be-selected word in the sample document, and taking the to-be-selected word with the occurrence frequency larger than a frequency threshold value as a sample characteristic word.
Specifically, all the to-be-selected words are traversed, the occurrence frequency of each to-be-selected word in the sample document is calculated, a frequency threshold (such as 10) is set, the to-be-selected words which have small contribution to classification and the occurrence frequency of which is less than the frequency threshold are deleted, and the to-be-selected words which are greater than or equal to the frequency threshold are selected as sample characteristic words.
And step 52), calculating the proportion of the sample documents containing the words to be selected in the total number of the sample documents, and taking the words to be selected with the corresponding proportion in a preset range as sample characteristic words.
Specifically, the document frequency of each word to be selected is first calculated according to formula (3), that is, the proportion of the sample documents containing the word to be selected to the total number of the sample documents is calculated. Setting a preset range, such as (0.005, 0.08), screening out the candidate words in the preset range as sample characteristic words
And step 53), calculating the information gain weight of the word to be selected, and taking the word to be selected with the corresponding information gain weight larger than the threshold value of the information gain weight as a sample characteristic word.
Specifically, an information gain weight threshold is set, such as 0.006. Calculating the information gain weight of each word to be selected according to the formula (4), and taking the word to be selected of which the corresponding information gain weight is greater than the threshold value of the information gain weight as a sample characteristic word
And 54) calculating mutual information values of the to-be-selected words, and taking the to-be-selected words with the corresponding mutual information values larger than the mutual information value threshold value as sample characteristic words.
Specifically, a mutual information value threshold is set, such as 1.54. Calculating the mutual information value of each word to be selected according to the formula (5) or (6), and taking the word to be selected with the corresponding mutual information value larger than the threshold value of the mutual information value as a sample characteristic word
And 55), calculating the correlation degree of the word to be selected and the preset category according to the probability of whether the word to be selected appears in the training sample set and whether the word to be selected belongs to the preset category, and taking the word to be selected with the corresponding correlation degree larger than the threshold value of the correlation degree as the sample characteristic word.
Specifically, a correlation threshold, such as 10, is set. And calculating the relevancy of each word to be selected through a formula 7 according to the probability of whether the word to be selected appears in the training sample set or not and whether the word to be selected belongs to the preset category or not, and taking the word to be selected with the corresponding relevancy larger than a relevancy threshold as a sample feature word.
Through all the steps 51) to 55), five groups of sample feature words can be generated, five sample text features are correspondingly generated, the capability of describing the product sample by the sample text features can be obviously improved, and the classification accuracy is improved.
In one embodiment, step 704 is preceded by the step of filtering out candidate words contained in the predetermined deactivation word list. Some characters or words which can cause classification interference may exist in the candidate words, such as the language words, the auxiliary words, and the like. Therefore, the stop word list is preset, and the words or words which can cause classification interference are added into the stop word list, so that the words to be selected contained in the preset stop word list are filtered, unnecessary calculation can be avoided, and the time required by product classification is saved.
Step 706, calculating the weight of the sample feature words according to the frequency of the sample feature words appearing in the sample documents, the total number of the sample documents and the number of the sample documents containing the sample feature words.
Specifically, after the sample feature words are screened out through the steps 51) to 55), the sample feature word weight of each group of sample feature words (step 51) to 55) is calculated according to the formula (8) to generate a group of sample feature words respectively corresponding to each step).
Step 708, generating sample text features of the product sample according to the sample feature word weights.
Specifically, after the weight of the sample feature word is calculated, each sample text can be converted into a vector with the sample feature word as a dimension, and the attribute value of each dimension is the weight of the sample feature word. Five sample text features can be obtained through all the steps 51) to 55). By adopting the five sample text characteristics, the classification accuracy of the classification model obtained by training can be improved.
As shown in FIG. 8, in one embodiment, step 204 includes:
step 802, a plurality of small image blocks with the same size are segmented from the sample images of the product samples in the training sample set, and an overlapping portion exists between adjacent small image blocks.
Specifically, one product sample corresponds to at least one sample image, with small image blocks of width 16 and length 16 being cut densely across each image. As shown in fig. 5, during cutting, the cutting start point is moved by 8 and 8 steps in the horizontal and vertical directions of the sample image of the product sample, that is, there is an overlapping portion between adjacent small image blocks, so that each picture will be cut into many small image blocks. The process of obtaining the small image blocks by partitioning is basically the same as the process of obtaining the small image blocks and the image sub-blocks by partitioning, and the difference is that the partitioned source images are different.
And step 804, extracting the gradient histogram characteristics of the small image block.
Specifically, step 804 includes steps 61) to 62):
step 61), dividing each small image block into a plurality of subunits with the same size and without overlapping.
Specifically, as shown in fig. 6, the small image blocks are divided into 4 equal parts in the horizontal and vertical directions, respectively, to obtain 16 subunits. The process of dividing the molecule unit is basically the same as the process of dividing the image unit, and the difference is the difference of the processing objects, which is not described again.
And 62), counting the gradient histogram features of 8 directions on each subunit, and splicing the gradient histogram features of the subunits corresponding to each small image block to obtain the gradient histogram features of each small image block.
Specifically, the gradient value and direction of each pixel point in each subunit are first calculated according to the above formula (9) and formula (10). And then, accumulating the gradient value of each pixel point in each subunit to a corresponding position in a vector according to the direction of each pixel point in each subunit, thereby obtaining the gradient histogram characteristics of the subunits. And then the gradient histogram characteristics of the subunits of each small image block are spliced together to obtain the gradient histogram characteristics of the small image block.
Step 806, calculating the euclidean distance between the gradient histogram feature of each small image block and each cluster center in the cluster center set obtained by pre-learning, and counting the cluster center in the cluster center set closest to the euclidean distance of the gradient histogram feature of each small image block.
And 808, generating sample image characteristics according to the clustering centers and counting results counted by the gradient histogram characteristics corresponding to each small image block.
In steps 806 to 808, specifically, each sample image feature is initialized to an all-zero vector having a length equal to the number of elements in the cluster center set, for each gradient histogram feature in the set of gradient histogram features of a plurality of small image blocks of the product sample, a euclidean distance between the gradient histogram feature and each cluster center point in the cluster center set is calculated, the cluster center point having the smallest euclidean distance from the gradient histogram feature of each small image block is counted, and the corresponding position in the initialized all-zero vector is counted, and the finally obtained vector is the generated sample image feature.
The principle of the product classification method is described below with a specific application scenario. Assume that the training sample set includes five categories of e-commerce product samples, such as knitwear, T-shirts, coats, pants, shirts in men's wear, with 300 products in each category. Each product sample corresponds to a sample document used for describing the product sample and at least one sample image, and all sample documents in the training sample set form a sample document set.
As shown in fig. 9, each sample document is segmented to obtain a word to be selected, the word to be selected contained in the stop word list is filtered, and sample feature words are screened from the word to be selected according to five evaluation functions, i.e., word frequency, document frequency, information gain, mutual information and evolution fitting. And then calculating the weight of the sample feature word of each sample feature word in each group of sample feature words, thereby obtaining five one-dimensional vectors, namely five sample text features, according to the weight of the sample feature words.
And dividing the sample image of the product sample into small image blocks, wherein the small image blocks at adjacent positions have overlapping parts. And then equally dividing the small image blocks into 16 subunits, counting gradient histogram features in 8 directions on each subunit, and splicing the gradient histogram features of the subunits corresponding to each small image block to obtain the gradient histogram features of each small image block. And then calculating the Euclidean distance between the gradient histogram feature of each small image block and each clustering center in a clustering center set obtained by pre-learning, counting the clustering center closest to the Euclidean distance of the gradient histogram feature of each small image block in the clustering center set, and counting, so as to generate a sample image feature, namely a one-dimensional vector, according to the counted clustering center and the counting result.
And splicing the sample text characteristic one-dimensional vector and the sample image characteristic one-dimensional vector of each product sample to obtain the sample characteristic of the product sample. And obtaining a product classification model based on the support vector machine according to the sample characteristics of the product sample.
The method comprises the steps that products to be classified correspond to a product document and at least one product image, word segmentation is conducted on the product document to obtain candidate words, the candidate words contained in a stop word list are filtered, and product feature words are screened out of the candidate words according to five evaluation functions of word frequency, document frequency, information gain, mutual information and evolution fitting test. And then calculating the product feature word weight of each product feature word in each group of product feature words, thereby obtaining five one-dimensional vectors, namely five product text features, according to the product feature word weight.
The product image of the product to be classified is divided into image small blocks, and the image small blocks adjacent to each other in position have overlapping parts. And then, equally dividing the image small blocks into 16 image units, counting gradient histogram features in 8 directions on each image unit, and splicing the gradient histogram features of the image units corresponding to each image small block to obtain the gradient histogram features of each image small block. And then calculating the Euclidean distance between the gradient histogram feature of each image small block and each clustering center in a clustering center set obtained by pre-learning, counting the clustering center closest to the Euclidean distance of the gradient histogram feature of each image small block in the clustering center set, and counting, so as to generate one-dimensional product image features according to the clustering centers and counting results counted by the gradient histogram feature corresponding to each image small block.
And splicing the one-dimensional vector of the product text characteristic and the one-dimensional vector of the product image characteristic to obtain the product characteristic. As shown in fig. 10, the product features are input into the product classification model obtained by training, and the product classification model outputs category labels to obtain classification results.
As shown in fig. 11, in one embodiment, a product classification apparatus is provided that includes a product text feature extraction module 1120, a product image feature extraction module 1140, a product feature generation module 1160, and a classification module 1180.
The product text feature extraction module 1120 is configured to extract product text features from product text describing a product to be classified.
The product image feature extraction module 1140 is used to extract product image features from the product image of the product to be classified.
The product feature generation module 1160 is used for generating product features of the products to be classified according to the product text features and the product image features.
The classification module 1180 is configured to input product features of the product to be classified into a product classification model obtained through pre-training, so as to obtain a classification result.
As shown in fig. 12, in one embodiment, the training sample set includes a plurality of product samples corresponding to preset categories, the product samples corresponding to sample texts and sample images for describing the product samples; the product classification apparatus further includes a training module 1110, and the training module 1110 includes a sample text feature extraction module 1112, a sample image feature extraction module 1114, a sample feature generation module 1116, and a training execution module 1118.
The sample text feature extraction module 1112 is configured to extract sample text features from sample texts of the product samples in the training sample set.
The sample image feature extraction module 1114 is configured to extract sample image features from sample images of product samples in the training sample set.
The sample feature generation module 1116 is configured to generate sample features from the sample text features and the sample image features.
The training execution module 1118 is used for training the product classification model based on the support vector machine according to the sample features.
In one embodiment, the sample text correspondence is stored in a sample document; as shown in fig. 13, the product text feature extraction module 1120 includes a first segmentation module 1122, a product feature word screening module 1124, a product feature word weight calculation module 1126, and a product text feature generation module 1128.
The first word segmentation module 1122 is configured to segment the product text to obtain candidate words.
The product characteristic word screening module 1124 is configured to screen product characteristic words from the candidate words according to a preset evaluation function.
The product feature word weight calculation module 1126 is configured to calculate a product feature word weight according to the frequency of occurrence of the product feature words in the sample documents, the total number of the sample documents, and the number of the sample documents containing the product feature words.
The product text feature generation module 1128 is configured to generate a product text feature of the product to be classified according to the product feature word weight.
In one embodiment, the product text feature extraction module 1120 further comprises a candidate word filtering module 1123 for filtering out candidate words contained in the preset stop word list.
As shown in fig. 14, in one embodiment, the product feature word screening module 1124 includes at least one of a first screening module 1124a, a second screening module 1124b, a third screening module 1124c, a fourth screening module 1124d, and a fifth screening module 1124 e.
The first screening module 1124a is configured to calculate the number of times that the candidate word appears in the sample document, and use the candidate word whose number of times of appearance is greater than or equal to a threshold number of times as the product feature word.
The second screening module 1124b is configured to calculate a proportion of sample documents containing the candidate words in the total number of the sample documents, and use candidate words whose corresponding proportion is within a preset range as product feature words.
The third filtering module 1124c is configured to calculate an information gain weight of the candidate word, and use the candidate word whose corresponding information gain weight is greater than the information gain weight threshold as the product feature word.
The fourth screening module 1124d is configured to calculate mutual information values of the candidate words, and use candidate words whose corresponding mutual information values are greater than a threshold value of the mutual information values as product feature words.
The fifth screening module 1124e is configured to calculate a correlation between the candidate words and the preset category according to whether the candidate words appear in the training sample set and the probability of whether the candidate words belong to the preset category, and use the candidate words with the corresponding correlation larger than the threshold value of the correlation as the product feature words.
As shown in fig. 15, in one embodiment, the product image feature extraction module 1140 includes an image patch segmentation module 1142, an image patch feature extraction module 1144, a first statistics and count module 1146, and a product image feature generation module 1148.
The image tile segmentation module 1142 is configured to segment a plurality of image tiles of the same size from a product image of a product to be classified, where there is an overlapping portion between adjacent image tiles.
The image patch feature extraction module 1144 is configured to extract gradient histogram features of the image patch.
The first statistics and counting module 1146 is configured to calculate euclidean distances between the gradient histogram feature of each image patch and each cluster center in the cluster center set obtained through pre-learning, and count a cluster center in the cluster center set closest to the euclidean distance of the gradient histogram feature of each image patch.
The product image feature generation module 1148 is configured to generate a product image feature according to the cluster center and the counting result counted by the gradient histogram feature corresponding to each image patch.
In one embodiment, the image patch feature extraction module 1144 includes an image element division module 1144a and a first feature stitching module 1144 b.
The image unit dividing module 1144a is used for dividing each image small block into a plurality of image units which have the same size and do not overlap.
The first feature stitching module 1144b is configured to count gradient histogram features of 8 directions on each image unit, and stitch the gradient histogram features of the image unit corresponding to each image tile to obtain the gradient histogram feature of each image tile.
As shown in fig. 16, in one embodiment, the sample text feature extraction module 1112 comprises a second word segmentation module 1112a, a sample feature word screening module 1112c, a sample feature word weight calculation module 1112d, and a sample text feature generation module 1112 e.
The second word segmentation module 1112a is configured to segment the sample text to obtain a word to be selected.
The sample feature word screening module 1112c is configured to screen a sample feature word from the to-be-selected words according to a preset evaluation function.
The sample feature word weight calculation module 1112d is configured to calculate a sample feature word weight according to the frequency of the sample feature words appearing in the sample documents, the total number of the sample documents, and the number of the sample documents containing the sample feature words.
The sample text feature generating module 1112e is configured to generate sample text features of the product samples according to the sample feature word weights.
In one embodiment, the sample text feature extraction module 1112 further comprises a candidate word filtering module 1112b for filtering out candidate words contained in the predetermined deactivation word list.
As shown in fig. 17, in one embodiment, the sample feature word filtering module 1112c includes at least one of a filter by times module 1112c1, a filter by document weight module 1112c2, a filter by information gain weight module 1112c3, a filter by mutual information value module 1112c4, and a filter by relevance module 1112c 5.
The times screening module 1112c1 is configured to calculate the times of occurrences of the candidate word in the sample document, and use the candidate word whose occurrence times is greater than the times threshold as the sample feature word.
The document proportion screening module 1112c2 is configured to calculate a proportion of sample documents containing a word to be selected to the total number of the sample documents, and use the word to be selected whose corresponding proportion is within a preset range as a sample feature word.
The information gain weight screening module 1112c3 is configured to calculate an information gain weight of a candidate word, and use the candidate word whose corresponding information gain weight is greater than the information gain weight threshold as a sample feature word.
The mutual information value screening module 1112c4 is configured to calculate a mutual information value of the candidate word, and use the candidate word whose corresponding mutual information value is greater than the threshold value of the mutual information value as the sample feature word.
The relevance screening module 1112c5 is configured to calculate relevance between a word to be selected and a preset category according to whether the word to be selected appears in the training sample set or not and the probability of whether the word to be selected belongs to the preset category, and use the word to be selected whose corresponding relevance is greater than a relevance threshold as a sample feature word.
As shown in fig. 18, in one embodiment, the sample image feature extraction module 1114 includes a small image block segmentation module 1114a, a small image block feature extraction module 1114b, a second statistics and counting module 1114c, and a sample image feature generation module 1114 d.
The small image block segmentation module 1114a is configured to segment a plurality of small image blocks of the same size from the sample images of the product samples in the training sample set, where there is an overlapping portion between adjacent small image blocks.
The small image block feature extraction module 1114b is configured to extract gradient histogram features of the small image block.
The second counting and counting module 1114c is configured to calculate a euclidean distance between the gradient histogram feature of each small image block and each cluster center in the cluster center set obtained by pre-learning, and count a cluster center in the cluster center set that is closest to the euclidean distance of the gradient histogram feature of each small image block.
The sample image feature generating module 1114d is configured to generate a sample image feature according to the cluster center and the counting result counted by the gradient histogram feature corresponding to each small image block.
In one embodiment, the small image block feature extraction module 1114b includes a subunit division module 1114b1 and a second feature concatenation module 1114b 2.
The subunit dividing module 1114b1 is used to divide each small image block into a plurality of subunits of the same size and not overlapping.
The second feature stitching module 1114b2 is configured to count gradient histogram features of 8 directions on each subunit, and stitch the gradient histogram features of the subunits corresponding to each small image block to obtain the gradient histogram feature of each small image block.
As shown in fig. 19, in an embodiment, the product classification apparatus further includes a cluster center set obtaining module 1130, and the cluster center set obtaining module 1130 includes a product sample selecting module 1132, an image sub-block segmenting module 1134, an image sub-block feature extracting module 1136, and a clustering module 1138.
The product sample selecting module 1132 is configured to select, from the training sample set, product samples corresponding to preset selection numbers of each preset category respectively.
The image sub-block segmentation module 1134 is configured to segment the product sample image corresponding to the selected product sample into a plurality of image sub-blocks with the same size, where adjacent image sub-blocks have overlapping portions.
The image sub-block feature extraction module 1136 is configured to extract gradient histogram features of the image sub-blocks.
The clustering module 1138 is configured to cluster the gradient histogram features of the image sub-blocks into a cluster center with a preset number of cluster centers, so as to obtain a cluster center set.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.