Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
First, nouns related to the embodiments of the present application will be described.
Depth Walk model (Deep Walk): is a model based on the realization of the depth migration algorithm. By inputting a graph or a network into the deep walk model and outputting the graph or the network as vector representations of vertexes in the network, the deep walk learns a characteristic representation of the network by cutting off the random walk (Truncated Random Walk), and a better effect can be obtained even in the case that few vertexes are marked on the network. The algorithm also has the advantage of being extensible and can adapt to the change of the network.
Co-occurrence relationship: meaning that the relationship between the target words occurs simultaneously in the text by analyzing at least two target words. The relation between the target words is usually determined according to the co-occurrence times of the target words in the same article, and the relation between the words can be found according to word frequency analysis or clustering analysis, so that the theme of the article is better determined. For example, the character roles appearing in the novel are extracted, and the character relationship between the two character roles is determined according to the number of times that the two characters appear simultaneously.
Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer services, smart audit texts, smart recognition texts quality, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and become more and more important.
The text quality recognition method provided by the embodiment of the application can be applied to computer equipment with stronger data processing capability. In a possible implementation manner, the text quality recognition method provided by the embodiment of the application can be applied to a personal computer, a workstation or a server, that is, training or using a text quality recognition model can be realized through the personal computer, the workstation or the server.
For the trained text quality recognition model, the model can be realized to be a part of an application program and is installed in a terminal, so that the application program can recognize the text quality, and a text with higher quality is pushed to a user; or the trained text quality recognition model is arranged in a background server of the application program, so that the terminal provided with the application program can recognize the text quality by means of the background server.
FIG. 1 illustrates a schematic diagram of a computer system provided in an exemplary embodiment of the present application. The computer system 100 includes a terminal 110 and a server 120, where the terminal 110 and the server 120 communicate data through a communication network, alternatively, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.
The terminal 110 is provided with an application program supporting a text reading function, and the application program may be any one of a browser application program, a news application program, a social application program, a web question-and-answer sharing application program, and a text reading application program, which is not limited in the embodiment of the present application.
Optionally, the terminal 110 may be a mobile terminal such as a smart phone, a smart watch, a tablet computer, an electronic book reader, a laptop portable notebook computer, an intelligent robot, or a terminal such as a desktop computer, a projection computer, which is not limited in the embodiment of the present application.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform. In one possible implementation, server 120 is a background server for applications in terminal 110.
As shown in fig. 1, in the present embodiment, a browser application is installed and operated in the terminal 110, a list of recommended articles is displayed on the browser application, and a user browses favorite articles through the browser application. The articles in the recommended article list are screened by the browser application program through the text quality recognition model in the server 120, and the server 120 sends the screened articles to the terminal 110, so that the recommended article list is displayed on the browser application program.
Illustratively, various articles are stored in the server 120, when the user reads the articles using the browser application program on the terminal 110, the terminal 110 sends the user account to the server 120, the server 120 selects high-quality articles from the stored articles for the user according to the user account, forms a recommended article list, and sends the recommended article list to the terminal 110.
The server 120 includes the text quality recognition model 10, and after the server 120 acquires the text 11, the server 120 extracts the keywords 12 from the text 11, and the server 120 determines the extracted keywords 12 as graph nodes and constructs the co-occurrence relationship structure diagram 13 from the graph nodes. The co-occurrence relation structure diagram 13 is input into the depth migration model 14, the co-occurrence relation structure diagram is subjected to diagram embedding processing, and a diagram vector 15 corresponding to the keyword is output. The figure vector 15 and the text vector 16 of the text are input into the text quality recognition model 10, and the quality level prediction probability 17 of the text is output. The server 120 thus classifies the articles according to the quality level prediction probability, and transmits the high-quality articles to the terminal 110 to form an article recommendation list.
The text vector 16 of the text at least includes one of a heading text vector and a body text vector, the heading text vector is a heading text vector corresponding to a heading in the article, and the body text vector is a body text vector corresponding to a body in the article.
For convenience of description, the following embodiments are described as examples by the server executing the recognition method of text quality.
Fig. 2 shows a flowchart of a method for recognizing text quality according to an exemplary embodiment of the present application. This embodiment is described taking the method for the server 120 in the computer system 100 shown in fig. 1 as an example, the method includes the following steps:
in step 201, a text vector of a text is obtained, where the text vector of the text at least includes one of a heading text vector and a body text vector, the heading text vector is a vector corresponding to a heading of the text, and the body text vector is a vector corresponding to a body of the text.
The text includes at least one of articles, news stories, poems, novels and sharing information in a social sharing application program, and the embodiment of the application does not limit the type of the text.
When the text quality is identified, the quality of the text can be more accurately determined by comprehensively determining a plurality of dimensions of the text. Illustratively, the server includes a corpus in which various types of text are stored. It will be appreciated that the text includes at least one of a title and a body, and that the text vector obtained from the text also includes at least one of a title text vector and a body text vector. The embodiments of the present application are described by taking text including a title and a body as examples.
Illustratively, the server extracts a body text vector and a heading text vector from the heading and body of the text, respectively. Illustratively, the body is processed using a fast text classification model (FastText model) to obtain a body text vector and a heading text vector. The FastText model is an open-source word vector and text classification tool, and can generate word vector characterization of text in an unsupervised learning mode. For example, the FastText model can learn that "boy", "girl", "boy", "girl" refer to a particular gender and can associate these vectors with related documents. Then, when the user queries in an application (application building the FastText model) (assuming "where My schoolbag is now. Compared with a neural network-based classification algorithm, the FastText model accelerates training speed and testing speed under the condition of keeping high precision, and word vectors can be trained by the FastText model without pre-trained word vectors.
It can be understood that the text may also include content such as labels, comments, etc., so that when the text vector of the text is obtained, a vector corresponding to the label, a vector corresponding to the comment, etc. may also be obtained. If an article is published on a public number and the article includes comment contents of comments made by a plurality of users, text vectors corresponding to the comment contents can be extracted according to the comment contents.
In some embodiments, heading text vectors and body text vectors may also be extracted from text by other text vector extraction models.
Step 202, obtaining a graph vector corresponding to a keyword in a text, wherein the graph vector is obtained by embedding the keyword into a graph.
The graph vector is a graph vector obtained after the graph embedding process. Extracting keywords in the text, and generating a graph vector of the text by using the keywords, namely performing graph embedding processing on the text to obtain the graph vector corresponding to the keywords in the text.
Before the server obtains the map vector corresponding to the keyword in the text, it needs to construct a structure diagram corresponding to the text, uses the keyword of the text as a Node (Node) of the structure diagram, and uses the co-occurrence relationship between the keywords as a relationship edge of the structure diagram, thereby forming the co-occurrence structure diagram of the keyword.
Schematically, the graph embedding processing is performed on the co-occurrence relation structure graph corresponding to the text through the depth migration model, namely, the words in the text are mapped into word vectors which can be recognized by a computer through the graph embedding processing.
And 203, classifying the text vector and the picture vector of the text to obtain the quality level prediction probability corresponding to the text.
Illustratively, a text quality recognition model is constructed in the server, and the text quality recognition model is a machine learning model with text recognition capability, and is used for judging the quality of the text by combining the content of the text representation, and has the capability of predicting and judging the quality of the text. The server calls a text quality recognition model to classify the text vector and the picture vector of the text, and the text quality recognition model outputs the quality grade prediction probability of the text. In other embodiments, the text vectors include a headline text vector and a body text vector, and the server invokes a text quality recognition model to classify the headline text vector, the body text vector, and the body text vector, the text quality recognition model outputting a quality class prediction probability of the text.
And 204, classifying the quality grades of the text quality according to the quality grade prediction probability.
The quality class prediction probability serves as a hierarchical label for classifying the quality class of the text. Illustratively, the quality of the text is divided into two classes, high quality and low quality, respectively, according to the quality class prediction probability of the text. In some embodiments, the quality of text is divided into good and non-good; in other embodiments, the quality of the text is divided into a class A, a class B, and a class C, where class A indicates that the quality of the text is best, class B indicates that the quality of the text is medium, and class C indicates that the quality of the text is low. The embodiment of the present application does not limit the specific division manner.
Illustratively, a probability threshold is set for the quality level prediction probability, the text with the quality level prediction probability higher than the probability threshold is classified as high quality text, and the text with the quality level prediction probability lower than the probability threshold is classified as low quality text.
In summary, in the method provided in this embodiment, the map vector is obtained by using the keywords in the text, the association relationship between the keywords in the text is represented by the map vector, and the central thought of the text can be deeply understood by using the association relationship between the keywords, so that the text content is accurately judged, and then the text vector (such as the characteristics of multiple dimensions of the heading text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, the subsequent application program and other platforms can recommend high-quality text to the user.
Fig. 3 shows a flowchart of a method for recognizing text quality according to another exemplary embodiment of the present application. This embodiment is described taking the method for the server 120 in the computer system 100 shown in fig. 1 as an example, the method includes the following steps:
in step 301, a text vector of a text is obtained, where the text vector of the text at least includes one of a heading text vector and a body text vector, the heading text vector is a vector corresponding to a heading of the text, and the body text vector is a vector corresponding to a body of the text.
The text includes at least one of articles, news stories, poems, novels and sharing information in a social sharing application program, and the embodiment of the application does not limit the type of the text.
When the text quality is identified, the quality of the text can be more accurately determined by comprehensively determining a plurality of dimensions of the text. Illustratively, the server includes a corpus in which various types of text are stored, and it is understood that the text includes at least one of a title and a body, and the text vector obtained from the text also includes at least one of a title text vector and a body text vector. The embodiments of the present application are described by taking text including a title and a body as examples.
Illustratively, the server extracts a body text vector and a heading text vector from the heading and body of the text, respectively. Illustratively, the body is processed using a fast text classification model (FastText model) to obtain a body text vector and a heading text vector. The FastText model is an open source word vector and text classification tool, and the FstText model can generate word vector representation of text in a learning mode of unsupervised learning. For example, the FastText model can learn that "boy", "girl", "boy", "girl" refer to a particular gender and can associate these vectors with related documents. Then, when the user queries in an application (application building the FastText model) (assuming "where My schoolbag is now. Compared with a neural network-based classification algorithm, the FastText model accelerates training speed and testing speed under the condition of keeping high precision, and word vectors can be trained by the FastText model without pre-trained word vectors.
It can be understood that the text may also include content such as labels, comments, etc., so that when the text vector of the text is obtained, a vector corresponding to the label, a vector corresponding to the comment, etc. may also be obtained. If an article is published on a public number and the article includes comment contents of comments made by a plurality of users, text vectors corresponding to the comment contents can be extracted according to the comment contents.
In some embodiments, heading text vectors and body text vectors may also be extracted from text by other text vector extraction models; in other embodiments, the headlines and the texts in the text are mapped into headline text vectors and text vectors, respectively, by way of One-hot encoding (One-hot encoding); in other embodiments, the headlines and the body in the text are mapped to headline text vectors and body text vectors, respectively, by word embedding (wordEmbedding).
Step 302, extracting keywords from the text.
Illustratively, extracting keywords from text includes at least one of: a keyword extraction algorithm based on statistical characteristics, a keyword extraction algorithm based on a word graph model and a keyword extraction algorithm based on a topic model.
1. The idea of the keyword extraction algorithm based on the statistical features is to extract keywords of the text by using the statistical information of the words in the text, usually, the text is preprocessed to obtain a set of candidate words, and then the keywords are obtained from the candidate set in a characteristic value quantization mode.
2. Keyword extraction based on a word graph model firstly builds a language network graph of the text, then analyzes the language network graph, and queries words or phrases with important roles on the language network graph, wherein the phrases are keywords of the text. Graph nodes in the language network graph are usually words, and the main forms of the language network graph are divided into four types according to different connection relations of the words: co-occurrence network graphs (co-occurrence relationship structure graphs), grammar network graphs, semantic network graphs, and other network graphs.
3. The keyword extraction algorithm based on the topic model mainly utilizes the distribution property of topics in the topic model to extract keywords. The method comprises the steps of firstly, obtaining candidate keywords from a text, calculating the topic distribution and the candidate keyword distribution of the text, calculating the topic similarity distribution of the text and the candidate keywords, and selecting the first n words as keywords.
In step 303, the keywords are determined as graph nodes.
In the construction process of the language network graph, the preprocessed words are usually used as graph nodes, and the relationships among the words are used as edges. In a linguistic network graph, the weights from edge to edge are generally represented by the degree of association between words. When keywords are obtained by using the language network graph, the importance of each graph node needs to be evaluated, then the graph nodes are ordered according to the importance, and words represented by TopK graph nodes are selected as keywords (K is a positive integer).
And step 304, constructing a co-occurrence relation structure diagram according to the graph nodes, wherein the co-occurrence relation structure diagram is used for representing the co-occurrence relation among the keywords in the text.
The embodiments of the present application will be described by taking a co-occurrence relationship structure diagram as an example. Generating a relationship edge in the co-occurrence relationship structure diagram according to the co-occurrence relationship of the keywords in the text, and constructing the co-occurrence relationship structure diagram according to the graph nodes and the relationship edge.
Co-occurrence relationship refers to a relationship between target words when at least two target words are simultaneously present in text by analyzing the relationship. The relationship between target terms is typically determined based on the number of times the target terms co-occur in the same article.
Fig. 4 shows a co-occurrence relationship structure diagram provided in an exemplary embodiment of the present application, in which circles containing sequence numbers represent graph nodes (nodes) 21 constructed with keywords in text, and the graph nodes are connected by relationship edges 22, and the relationship edges 22 are used to represent co-occurrence relationships between the keywords. In some embodiments, the length of a relationship edge indicates the importance of a relationship, e.g., a shorter relationship edge between two graph nodes indicates a stronger association between two graph nodes.
And 305, calling a depth migration model to perform graph embedding processing on the co-occurrence relation structure graph, and outputting a graph vector corresponding to the keyword.
Determining an ith graph node in the co-occurrence relation structure diagram as a root node, wherein the ith graph node corresponds to an ith keyword in a text, and i is a positive integer; acquiring n relation paths corresponding to the root node serving as an origin, wherein the relation paths correspond to relation edges in the co-occurrence relation structure diagram one by one, and n is a positive integer; selecting a target relation path from the n relation paths to perform random walk processing to obtain a group of word sequences corresponding to the target relation path; and repeatedly executing the three steps until all graph nodes in the co-occurrence relation structure diagram are traversed, performing graph embedding processing on the word sequence corresponding to the relation path, and outputting the graph vector corresponding to the keyword.
The graph embedding process utilizes the idea of word embedding (word vector), the basic processing element of word embedding is word, and the corresponding element of graph embedding process of the co-occurrence relation structure graph is graph node; word embedding is to analyze word sequences in a sentence, and to embed word sequences formed by graph nodes in a co-occurrence relation structure diagram in a random walk mode. Random Walk (Random Walk) refers to continuously and repeatedly randomly selecting a Walk path in a co-occurrence relationship structure diagram, and finally forming a path penetrating through the co-occurrence relationship structure diagram. Each step of the walk starting from a certain graph node randomly selects one of the relationship edges connected with the current graph node, and moves to the next graph node along the selected relationship edge, and the process is repeated.
As shown in fig. 4, the graph node designated with the number 3 starts to walk, a path pointing to the graph node of the number 4 is selected, a path pointing to the graph node of the number 6 is selected, and the whole walking path is the graph node 3→the graph node 4→the graph node 6→the graph node 7. It will be appreciated that any graph node may be designated as an initial graph node for walk.
Defining the co-occurrence relationship structure diagram as G= (V, E), wherein V represents a set of graph nodes of the co-occurrence relationship structure diagram, E represents a set of relationship edges of the co-occurrence relationship structure diagram, and
illustratively, taking the ith graph node corresponding to the ith keyword as the root node, a path obtained by random walk is (v) 0 ,v 1 ,...,v i ) When the ith word is embedded, the probability P (v) of the ith target word needs to be calculated i |(v 0 ,v 1 ,...,v i-1 ))。
In the random walk process, the window size needs to be determined, and the window size refers to the unit step length moving along the target relation path in the random walk process. Obtaining a target relation path and a window size in a random walk process, wherein the window size corresponds to the number of intercepted graph nodes; and intercepting the first m graph nodes and the last m graph nodes which are connected with the ith graph node according to the window size along a target relationship path to obtain a group of word sequences corresponding to the target relationship path, wherein m is a positive integer. As shown in fig. 5, when the window size is 2 with the t-th word as the center, two words (w (t-2), w (t-1)) before the t-th word and two words (w (t+1), w (t+2)) after the t-th word are intercepted.
Illustratively, the probability of the ith target Word is calculated using a Word vector model (Word 2vec model), and common Word2vec models include a continuous Word bag model (Cbow model) and a Skip-gram model (Skip-gram model), the Cbow model predicts a center Word according to context, the Skip-gram model predicts a context according to the center Word, and the two models are similar, wherein the Skip-gram model has a structure as shown in fig. 5. The way the probability is calculated by Skip-gram model is as follows:
wherein w is i Represents the ith keyword, m represents the window size of the random walk, j represents the order of keywords in the word sequence, v i Representing the ith graph node corresponding to the ith keyword, wherein V represents a set formed by all word sequences,transpose matrix representing word vector (matrix) corresponding to kth word,/for>Representing the transpose of the word vector (matrix) corresponding to the (i-m+j) th word.
And step 306, calling a full connection layer in the text quality recognition model to splice the text vector and the graph vector of the text, and outputting a spliced first vector.
As shown in fig. 6, the text vector 16 of the text and the map vector 15 corresponding to the keyword are subjected to the concatenation processing 31 (Concat), the first vector after concatenation is input to the text quality recognition model 10, which includes a plurality of fully connected layers 101 (tense layers), the map vector 15 corresponding to the text vector 16 of the text and the keyword is input to the fully connected layers 101, and the first vector is output by the processing of the plurality of fully connected layers 101. The text vector 16 includes at least one of a heading text vector and a body text vector, the heading text vector being a vector corresponding to a heading of the text, and the body text vector being a vector corresponding to a body of the text.
Step 307, call the discarding layer in the text quality recognition model to process the first vector, output the second vector, and the discarding layer is a neural network layer for filtering the useless feature vector from the first vector.
Text quality recognition model 10 includes a discard layer (Dropout layer) 102 that is invoked to process a first vector and output a second vector. In the forward propagation, the activation value of a certain neuron stops working with a certain probability p, so that the model generalization can be more strong, because the model does not depend on certain local characteristics too. The proportion of neurons which stop working is controlled by the discarding rate, and the discarding rate is the ratio of the discarded neuron nodes of the layer to the total neuron nodes.
And 308, calling a logistic regression layer in the text quality recognition model to process the second vector, and outputting the quality grade prediction probability corresponding to the text.
The text quality recognition model 10 includes a logistic regression layer (sigmoid layer) 103, and the logistic regression layer 103 is called to process the second vector and output the quality level prediction probability corresponding to the text. The output value of the Sigmoid layer is between 0 and 1, in the two classification tasks, the output of the Sigmoid layer is the prediction probability, and when the output prediction probability meets a certain condition, the input corresponding to the prediction probability is divided into positive types. For example, if the probability threshold is set to 0.7 for the quality level prediction probability and the quality level prediction probability output by the logistic regression layer 103 is 0.8, the quality level of the text corresponding to the quality level prediction probability is high (high-quality text).
The quality class prediction probability is calculated by the following formula:
wherein c i And (3) representing the ith keyword, v is a graph node in the co-occurrence relationship structure diagram, Z represents a constant, and E represents a set of relationship edges in the co-occurrence relationship structure diagram.
And 309, classifying the quality grades of the text quality according to the quality grade prediction probability.
The quality class prediction probability serves as a hierarchical label for classifying the quality class of the text. Illustratively, the quality of the text is divided into two classes, high quality and low quality, respectively, according to the quality class prediction probability of the text. In some embodiments, the quality of text is divided into good and non-good; in other embodiments, the quality of the text is divided into a class A, a class B, and a class C, where class A indicates that the quality of the text is best, class B indicates that the quality of the text is medium, and class C indicates that the quality of the text is low. The embodiment of the present application does not limit the specific division manner.
Illustratively, a probability threshold is set for the quality level prediction probability, the text with the quality level prediction probability higher than the probability threshold is classified as high quality text, and the text with the quality level prediction probability lower than the probability threshold is classified as low quality text.
In summary, in the method of the embodiment, the map vector is obtained by using the keywords in the text, the association relationship between the keywords in the text is represented by the map vector, and the central thought of the text can be deeply understood by using the association relationship between the keywords, so that the text content is accurately judged, and the text vector (such as the characteristics of multiple dimensions of the heading text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, the platforms such as the subsequent application program and the like can recommend the high-quality text to the user.
The method comprises the steps of extracting keywords from a text, constructing a co-occurrence relation structure diagram according to the keywords and co-occurrence relations among the keywords, converting the text into the co-occurrence relation structure diagram, representing the content of the text by the co-occurrence relation structure diagram, and generating word vectors of the text according to the co-occurrence relation structure diagram by using a deep walk model.
And constructing a co-occurrence relation structure diagram by combining relation edges corresponding to the co-occurrence relation among the keywords and nodes corresponding to the keywords, and determining the relation among the keywords more intuitively through the co-occurrence relation structure diagram.
By taking any graph node as a root node for random walk processing, graph nodes in the co-occurrence relation structure diagram can be traversed to form a keyword sequence corresponding to the graph nodes, and keywords can be accurately mapped into word vectors through subsequent word embedding processing.
The graph nodes are intercepted by utilizing the window size to form a word sequence, so that the word sequence extracted from the co-occurrence relation structure diagram is ensured to be correct, and the word sequence is also ensured to be correct when being mapped into word vectors subsequently.
By arranging the full connection layer, the discarding layer and the logistic regression layer in the text quality recognition model, the text quality recognition model can accurately recognize the text quality.
In an alternative embodiment based on fig. 3, the depth walk model is trained by:
in step 320, sample text is obtained, where the sample text corresponds to a real sample map vector.
Illustratively, the sample text is stored in a corpus of servers, or the sample text is text stored in a terminal, which sends the sample text to the servers.
Illustratively, step 320 may be replaced by the steps of:
In step 3201, text rating data is acquired, wherein the text rating data comprises at least one of a reading amount, a comment amount, a forwarding amount and a praise amount of the text.
The server stores the sample text and corresponding text evaluation data. Illustratively, the text evaluation data of the sample text includes the reading amount and the comment amount of the text, for example, the comment amount of an article posted in a public number is 10350, the reading amount is 10 ten thousand, the comment amount of an article posted in another public number is 2300, and the reading amount is 3 ten thousand.
In step 3202, positive sample text and negative sample text are selected from the sample text according to the text evaluation data.
Illustratively, corresponding preset conditions are set for text evaluation data, for example, articles with comment quantity exceeding 1 ten thousand are high-quality articles, or articles with reading quantity exceeding 5 ten thousand are high-quality articles; and the comment quantity or the reading quantity does not exceed the preset condition, and the article is a non-quality article. For another example, articles with comments exceeding 1 ten thousand but not exceeding 5 ten thousand are medium quality articles. The embodiment of the present application does not limit the setting manner of the evaluation criteria. The positive sample text is a sample text with good quality grade, and the negative sample text is a sample text with bad quality grade.
The sample text for training corresponds to the truly generated sample map vector. The sample map vector is generated by other trained depth walk models.
And 321, extracting sample keywords from the sample text, wherein the sample keywords correspond to sample graph nodes in the sample co-occurrence relation structure chart one by one.
Keywords of the sample text may be extracted by the method in the above embodiment, or by manual reading, which is not limited in this embodiment of the present application. And taking the sample keywords in the sample text as sample graph nodes in the sample co-occurrence relation structure chart.
Step 322, constructing a sample co-occurrence relationship structure diagram according to the sample graph nodes, wherein the sample co-occurrence relationship structure diagram is used for representing co-occurrence relationships among sample keywords in the sample text.
And generating a relationship edge in the sample co-occurrence relationship structure diagram according to the co-occurrence relationship among the sample keywords in the sample text, and constructing the sample co-occurrence relationship structure diagram according to the sample graph nodes and the relationship edge determined according to the sample keywords in the step 321.
Step 323, input the sample co-occurrence relation structure diagram into the depth walk model, output the prediction sample diagram vector corresponding to the sample keyword.
And splicing (Concat) the sample co-occurrence relation structure diagram and the title text vector extracted from the title of the sample text, and the text vector extracted from the text of the sample text, inputting the spliced vector into a depth migration model, and outputting a predicted sample diagram vector corresponding to the sample keyword through a plurality of full connection layers, a discarding layer and a logistic regression layer.
Step 324, training the depth migration model according to the predicted sample map vector and the real sample map vector to obtain a trained depth migration model.
And calculating an error result between the sample graph vector and the real graph vector by using an error function, and training the depth migration model by using a back propagation algorithm according to the error result.
In summary, according to the method of the embodiment, the positive sample and the negative sample which are more suitable for training are selected from the sample text through the text evaluation data, so that the text quality recognition model can be comprehensively trained, and the trained text quality recognition model can accurately recognize the quality of the text.
In one example, a user reads an article in a sharing type application, a terminal used by the user sends a reading request to a server, and the server recommends a good article to the user according to the method provided in the above embodiment.
The quality determination method for the article generally considers two aspects of the objective priori experience of the article (including at least one of the typesetting of the article, the definition of the matching chart of the article, the attractive degree and the matching degree of the matching chart and the article content) and the quality of the text content of the article. The text content quality can be divided into two measurement dimensions of the basic content quality of the article and the attractive force of the article content. Embodiments of the present application are described with respect to the dimension of article content appeal.
Firstly, a server extracts keywords of all articles from a corpus to construct Node nodes (Graph nodes), a co-occurrence relation Graph is constructed through co-occurrence relation of the keywords in the same article, and then a Random Walk is used for sampling a co-occurrence relation structure Graph and training Graph Embedding of the Graph nodes by using a Skip-gram model. After Graph training based on article keywords and Fasttext pre-training word vectors based on article titles and texts, attractive positive and negative samples are screened according to posterior consumption data (text evaluation data) of the image-text contents by a user, text vectors generated by the Fasttext pre-training of the titles and the texts are spliced (Concat), graph vectors generated by the article keywords are input as models, and finally multiple layers of Dense layers are connected for classification, so that an identification scheme of high-quality attractive graphics is completed. As shown in fig. 6, the quality level prediction probability (quality level label) of the article is output in the logistic regression layer, and the quality level prediction probability is input into the attractive model 32, so that whether the article has attractive force to the user can be judged later. In some embodiments, the quality level prediction probabilities of the articles may also be input into other models so that the information of interest to the user can be accurately recommended later.
Through experiments, compared with the method used in the prior art, the method provided by the embodiment of the application has the advantages that the recognition accuracy of text quality reaches 95.86%, the coverage rate of high-quality content of pictures and texts reaches 16.8%, the overall high-quality exposure ratio is improved by 0.5%, the total efficiency is improved by 0.4%, the total duration is improved by 0.32%, and the pushing efficiency is improved by 0.28%.
The method provided by the embodiment of the invention can be applied to a content processing link of a content center, after the server performs content quality scoring on all image-text contents, the server distributes the image-text contents to an end side, and the end side performs hierarchical recommendation weighting according to the content quality score, such as recommendation weighting on the identified high-quality contents, recommendation weight reduction on low-quality contents and the like.
Fig. 7 is a block diagram of a text quality recognition apparatus according to an exemplary embodiment of the present application, the apparatus including:
The obtaining module 710 is configured to obtain a text vector of a text, where the text vector of the text includes at least one of a heading text vector and a body text vector, the heading text vector is a vector corresponding to a heading of the text, and the body text vector is a vector corresponding to a body of the text;
the obtaining module 710 is configured to obtain a graph vector corresponding to a keyword in a text, where the graph vector is obtained by performing graph embedding processing on the keyword;
the classification module 720 is configured to classify a text vector and a graph vector of the text, so as to obtain a quality level prediction probability corresponding to the text;
and a quality classification module 730, configured to classify the quality level of the text quality according to the quality level prediction probability.
In an alternative embodiment, the apparatus includes a processing module 740;
the obtaining module 710 is configured to extract keywords from text; determining the keywords as graph nodes;
the processing module 740 is configured to construct a co-occurrence relationship structure diagram according to the graph nodes, where the co-occurrence relationship structure diagram is used to characterize co-occurrence relationships between keywords in the text; and calling the depth migration model to perform graph embedding processing on the co-occurrence relation structure graph, and outputting a graph vector corresponding to the keyword.
In an optional embodiment, the processing module 740 is configured to determine an ith graph node in the co-occurrence relationship structure chart as a root node, where the ith graph node corresponds to an ith keyword in the text, and i is a positive integer; the acquiring module 710 is configured to acquire n relationship paths corresponding to the root node as an origin, where the relationship paths correspond to relationship edges in the co-occurrence relationship structure map one to one, and n is a positive integer; the processing module 740 is configured to select a target relationship path from the n relationship paths to perform random walk processing, so as to obtain a group of word sequences corresponding to the target relationship path; and repeatedly executing the three steps until all graph nodes in the co-occurrence relation structure diagram are traversed, performing graph embedding processing on the word sequence corresponding to the relation path, and outputting the graph vector corresponding to the keyword.
In an alternative embodiment, the processing module 740 is configured to obtain a target relationship path and a window size during the random walk, where the window size corresponds to the number of the intercepted graph nodes; and intercepting the first m graph nodes and the last m graph nodes which are connected with the ith graph node according to the window size along the target relationship path to obtain a group of word sequences corresponding to the target relationship path, wherein m is a positive integer.
In an alternative embodiment, the processing module 740 is configured to generate a relationship edge in the co-occurrence relationship structure diagram according to the co-occurrence relationship of the keyword in the text; and constructing a co-occurrence relation structure diagram according to the graph nodes and the relation edges.
In an optional embodiment, the processing module 740 is configured to call a full connection layer in the text quality recognition model to perform a stitching process on the text vector and the graphics vector of the text, and output a stitched first vector; invoking a discarding layer in the text quality recognition model to process the first vector and output a second vector, wherein the discarding layer is a neural network layer used for filtering useless feature vectors from the first vector; and calling a logistic regression layer in the text quality recognition model to process the second vector, and outputting the quality grade prediction probability corresponding to the text.
In an alternative embodiment, the apparatus includes a training module 750;
the training module 750 is configured to obtain a sample text, where the sample text corresponds to a real sample graph vector; extracting sample keywords from the sample text, wherein the sample keywords are in one-to-one correspondence with sample graph nodes in a sample co-occurrence relationship structure diagram; constructing a sample co-occurrence relationship structure diagram according to the sample graph nodes, wherein the sample co-occurrence relationship structure diagram is used for representing the co-occurrence relationship among sample keywords in a sample text; inputting the sample co-occurrence relation structure diagram into a depth migration model, and outputting a predicted sample diagram vector corresponding to a sample keyword; and training the depth migration model according to the predicted sample map vector and the real sample map vector to obtain a trained depth migration model.
In an alternative embodiment, the training module 750 is configured to obtain text evaluation data, where the text evaluation data includes at least one of a reading amount, a comment amount, a forwarding amount, and a praise amount of the text; and selecting positive sample text and negative sample text from the sample text according to the text evaluation data.
In summary, in the device provided in this embodiment, the map vector is obtained by using the keywords in the text, the association relationship between the keywords in the text is represented by the map vector, and the central thought of the text can be deeply understood by using the association relationship between the keywords, so that the text content is accurately judged, and then the text vector (such as the characteristics of multiple dimensions of the heading text vector and the body text vector) of the text is combined, so that the text quality recognition model can accurately recognize the overall quality of the text, and further, the subsequent application program and other platforms can recommend high-quality text to the user.
The method comprises the steps of extracting keywords from a text, constructing a co-occurrence relation structure diagram according to the keywords and co-occurrence relations among the keywords, converting the text into the co-occurrence relation structure diagram, representing the content of the text by the co-occurrence relation structure diagram, and generating word vectors of the text according to the co-occurrence relation structure diagram by using a deep walk model.
And constructing a co-occurrence relation structure diagram by combining relation edges corresponding to the co-occurrence relation among the keywords and nodes corresponding to the keywords, and determining the relation among the keywords more intuitively through the co-occurrence relation structure diagram.
By taking any graph node as a root node for random walk processing, graph nodes in the co-occurrence relation structure diagram can be traversed to form a keyword sequence corresponding to the graph nodes, and keywords can be accurately mapped into word vectors through subsequent word embedding processing.
The graph nodes are intercepted by utilizing the window size to form a word sequence, so that the word sequence extracted from the co-occurrence relation structure diagram is ensured to be correct, and the word sequence is also ensured to be correct when being mapped into word vectors subsequently.
By arranging the full connection layer, the discarding layer and the logistic regression layer in the text quality recognition model, the text quality recognition model can accurately recognize the text quality.
Positive samples and negative samples which are more suitable for training are selected from sample texts through text evaluation data, so that a text quality recognition model can be comprehensively trained, and the trained text quality recognition model can accurately recognize the quality of the texts.
It should be noted that: the text quality recognition device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above. In addition, the text quality recognition device provided in the above embodiment and the text quality recognition method embodiment belong to the same concept, and detailed implementation processes of the text quality recognition device are shown in the method embodiment, and are not repeated here.
Fig. 8 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. The server may be such as server 120 in computer system 100 shown in fig. 1.
The server 800 includes a central processing unit (CPU, central Processing Unit) 801, a system Memory 804 including a random access Memory (RAM, random Access Memory) 802 and a Read Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806 (Input Output System) to facilitate the transfer of information between the various devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
The basic input/output system 806 includes a display 808 for displaying information and an input device 809, such as a mouse, keyboard, or the like, for user input of information. Wherein both the display 808 and the input device 809 are connected to the central processing unit 801 via an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 810 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or compact disc read only memory (CD-ROM, compact Disc Read Only Memory) drive.
Computer readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (EPROM, erasable Programmable Read Only Memory), electrically erasable programmable read-only memory (EEPROM, electrically Erasable Programmable Read Only Memory), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD, digital Versatile Disc) or solid state disks (SSD, solid State Drives), other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 804 and mass storage device 807 described above may be collectively referred to as memory.
According to various embodiments of the present application, server 800 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., server 800 may be connected to a network 812 through a network interface unit 811 connected to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.
The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.
In an alternative embodiment, a computer device is provided that includes a processor and a memory having at least one instruction, at least one program, code set, or instruction set stored therein, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by the processor to implement a method of identifying text quality as described above.
In an alternative embodiment, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement a method of identifying text quality as described above is provided.
Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers are merely for describing, and do not represent advantages or disadvantages of the embodiments.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform the method of identifying text quality as described in the above aspect.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.