CN103064987B

CN103064987B - A kind of wash sale information identifying method

Info

Publication number: CN103064987B
Application number: CN201310037691.8A
Authority: CN
Inventors: 王永康; 张爱华
Original assignee: Beijing 58 Information Technology Co Ltd
Current assignee: Beijing 58 Information Technology Co Ltd
Priority date: 2013-01-31
Filing date: 2013-01-31
Publication date: 2016-09-21
Anticipated expiration: 2033-01-31
Also published as: CN103064987A

Abstract

The invention discloses a kind of wash sale information identifying method, including: step S101, obtain information characteristics, information content and/or pictorial information that user releases news；Step S201, information characteristics, information content and/or the pictorial information released news according to user, user is given out information and carries out wash sale information identification.The present invention can greatly reduce the false amount of Transaction Information, improves the verity of Transaction Information, increases Consumer's Experience, can greatly reduce human cost simultaneously.

Description

A kind of wash sale information identifying method

Technical field

The present invention relates to Internet technical field, particularly relate to a kind of wash sale information identifying method.

Background technology

Along with the development of the Internet, online information becomes increasingly to spread unchecked, more and more hard to tell whether it is true or false.For ecommerce Or the website of the type such as classification information, if it is possible to provide the user safety, real merchandise news, have become as one important And basic content, the most how to identify that user releases news true and false has had become as the key guaranteeing information security, this Also it is the problem that all suffers from of a lot of website.

Identifying in wash sale information, current method mainly by artificial examination & verification, more additional technological means, Such as determine the IP(Internet Protocol of blacklist, the agreement of interconnection between network) address, determine the information of issue in Perhaps form is illegal, price range is illegal etc. will determine the illegal information deletion of information completely.

The shortcoming of Existing policies is: manual examination and verification consume very much manpower, the technological means of auxiliary can only delete least a portion of void False Transaction Information, the most substantial amounts of wash sale information is escaped, and can delete 100% information being defined as falseness, but to having The information that 85% may be false is helpless, because all can not judge that information is false degree.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of wash sale information identifying method and puts, in order to solve existing skill Art carries out the problem that in wash sale information identification, manpower consumption is big, wash sale information discrimination is low.

For solving above-mentioned technical problem, on the one hand, the present invention provides a kind of wash sale information identifying method, including:

Step S101, obtains information characteristics, information content and/or pictorial information that user releases news；

Step S201, information characteristics, information content and/or the pictorial information released news according to user, user is issued Message carries out wash sale information identification.

Further, before obtaining the information characteristics that user releases news, comprise the following steps:

Step S1011, the master data that before acquisition, user gives out information；

Step S1012, the master data given out information according to user before obtaining, extract training data, determine positive and negative Sample；

Step S1013, aligns the data in negative sample and carries out Feature Conversion, obtain the data of setting data form；

Step S1014, according to the data of setting data form, sets up regression model.

Further, step S1013 specifically includes:

The feature of the every data in positive negative sample is defined as numeric type or enumeration type two class；

The dimension values of numeric type is constant, and the number of this numeric type data is disposed in the position in numeric type data is in sample Value；

The dimension values of enumeration type first calculates its md5 value, then by md5 value to W delivery, obtains delivery result；In the sample The numerical value being in delivery result position is put 1.

Further, step S1014 specifically includes:

The data of setting data form step S1013 obtained are converted into sparse matrix；

Sparse matrix (the x of generation is inputted in model training program₁,x₂,x₃,x₄,x₅..., x_p), p is setting data The data volume of the data of form；Obtain each parameter (β corresponding to record₀,β₁,β₂,β₃,β₄,β₅..., β_p)；

Setting up regression model, regression model is:Wherein g (x)=β₀+β₁x₁+β₂x₂+…+ β_px_p。

Further, after setting up regression model, when receiving user and releasing news, then step S101 particularly as follows:

Step S1015, obtains the master data that user gives out information；Including extract the basic feature that gives out information of user and Obtain unit's feature；Basic feature and first feature are together as the master data excavated.

Further, after obtaining the master data that gives out information of user, step S201 specifically includes following steps:

Step S2011, the master data giving out information acquisition user carries out Feature Conversion, obtains the accessible number of model According to form；

Step S2012, data step S2011 obtained are converted into the form of sparse matrix, are carried out by regression model Spoofing identification；Wherein, P > M, then Y=1, represent that user releases news as true sale information；Otherwise, P≤M, then Y=0, table Showing that user releases news as wash sale information, M is threshold value set in advance.

Further, before obtaining the information content that user releases news, comprise the following steps:

Step S1021, information content that before acquisition, user gives out information also is audited, and by examination & verification and will not pass through The information of examination & verification is divided into two classes, as the sample data of classification；

Step S1022, carries out participle to the information content in sample；

Step S1023, by calculating, extraction feature word；

Step S1024, calculates the eigenvalue of each Feature Words in every document of every apoplexy due to endogenous wind；

Step S1025, according to the eigenvalue of each word obtained in every document of every apoplexy due to endogenous wind, is identified mould by training Type.

Further, step S1023 specifically includes:

CHI value is sought in each word；Evolution inspection formula is:Wherein, A: classify at this Under comprise the number of documents of this word；B: do not comprise the number of documents of this word under this classification；C: do not wrap under this is classified Number of documents containing this word；D: not under this classification, and do not comprise the number of documents of this word；N: represent article sum；T: Represent the word currently seeking CHI value；The classification of c: presentation class；x²: represent open inspection CHI value；

Then P maximum value of CHI value in all words is taken as Feature Words；

Step S1024 specifically includes:

The deformation algorithm using TFIDF algorithm or TFIDF calculates eigenvalue, and wherein the way of TFIDF is to calculate every apoplexy due to endogenous wind The number of times of each Feature Words in every document, and comprise the number of files of this word, by the value of TFIDF as eigenvalue；Its In, every document is converted into: category IDs t feature sequence number the form of t eigenvalue；TFIDF formula is: TFIDF=TF × IDF, Wherein, TF is the frequency that certain Feature Words occurs in this document, and IDF is anti-document frequency, and i.e. total document tree is divided by comprising The number of files of this word.

Further, after obtaining the information content that gives out information of user, step S201 specifically includes following steps:

Step S2021, the information content giving out information user carries out participle；

Step S2022, by calculating, extraction feature word；

Step S2023, the eigenvalue of each word in the information content that calculating user gives out information；

Step S2024, according to the identification model obtained, the information content giving out information user carries out wash sale information Identify.

Further, the pictorial information released news according to user, user is given out information and carries out wash sale information identification, Specifically include following steps:

Step S2031, query history picture library, it is judged that whether photo current occurs in picture library, if there is, then enter One step judges that content of posting is the most identical, and position is the most identical, if all different, then judges that the user comprising this picture sends out Cloth information is wash sale information；Otherwise, then judging that the user comprising this picture releases news is true sale information；

Or, it is judged that whether there is watermark on picture, if it has, the watermark then determined whether on picture is the most legal, as The most illegal, then judging that the user comprising this picture releases news is wash sale information；Otherwise, then judge to comprise this picture It is true sale information that user releases news.

The present invention has the beneficial effect that:

The present invention can greatly reduce the false amount of Transaction Information, improves the verity of Transaction Information, increases user's body Test, human cost can be greatly reduced simultaneously.

Accompanying drawing explanation

Fig. 1 is the flow chart of a kind of wash sale information identifying method in the embodiment of the present invention.

Detailed description of the invention

Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that described herein Specific embodiment only in order to explain the present invention, do not limit the present invention.

As it is shown in figure 1, the present embodiments relate to a kind of wash sale information identifying method, including:

In step S101, being specifically related to three kinds of situations, the first is that the information characteristics released news for user carries out void False Transaction Information identification, namely carries out wash sale information identification based on user characteristics and behavior；The second is for user The information content released news carries out wash sale information identification, namely carries out wash sale information based on model content of text Identify；The third is the wash sale information identification carried out for pictorial information.

First, describe the information characteristics released news based on user and carry out wash sale information identification, send out obtaining user Before the information characteristics of cloth information, comprise the following steps:

Step S1011, the master data that before acquisition, user gives out information.In this step, by splicing data, analyze and use Posting daily record in family, extracts the basic feature that user gives out information；Wherein, basic feature refers to directly to send out from user before Cloth message extracts the data of acquisition, such as, the identity (USER ID) of user, the IP that posts, cookieid, telephone number, Temporal information (including week, month, date), duration of posting, pageview, refreshing amount, city of posting, the features such as classification of posting. Then, according to the basic feature of user, obtain unit's feature；Wherein, unit's feature refers on the basis of the basic feature of user, logical Cross statistics or calculated data；Such as with IP post number, city number of posting with IP, with user post number, post with user The unit such as city number, post with cookie number, city number of posting with cookie feature.Basic feature and first feature are together as excavation Master data.Such as, produce such record R1(123123,192.168.11.11, a DFOKIEBNGIDH1232, 18311067654 ...).

Step S1012, the master data given out information according to user before obtaining, extract training data.In this step, Based on the result of step S1011, verified by manual examination and verification and be defined as true or false data, as positive and negative sample This, truthful data is positive sample, and false data is negative sample；Such as, R1 is labeled as positive sample or negative sample.Manual examination and verification Process, can know that according to some information artificially judges altogether, it is also possible to carries out checking by means such as phones and confirms.

Step S1013, aligns the data in negative sample and carries out Feature Conversion, obtain the data of setting data form.This step In Zhou, the feature of the every data in positive negative sample being defined as numeric type or enumeration type two class, wherein, numeric type refers to data Inherently numerical value；Enumeration type refers to that data itself are not numerical value, and enumeration type carries out mapping according to original dimension and value and obtains. The data that such as USER ID, the IP that posts etc. are enumeration type, duration of posting, the data that city number is numeric type of posting with user.Number The dimension values of value type is constant；Such as, certain characteristic is 20, and position in the sample is at the 10th, then on the 10th position Put 20.The dimension values of enumeration type the most first calculates its md5(Message Digest Algorithm MD5, Message Digest 5 Five editions) value, then by md5 value to W(such as W=300000) delivery, it may be assumed that by md5 value divided by 300000, obtain remainder；So The value of enumeration type will fall between 1-300000.Such as there are two features: (telephone number, number of posting with phone), corresponding Value is (18211078765,100), posts what number was numeric type with phone, and telephone number is enumeration type, so sending out with phone The position of note number invariant position in the sample, after telephone number calculates md5 value, to 300000 deliverys, such as, obtains 180834, Now this record produce vector be (0,100,0 ..., 1), wherein, put 1 in the sample on the position of the 180834th, table Showing that there is numerical value this position, numerical value is 1.

Step S1014, according to the data of setting data form, sets up regression model.Requirement to regression model is to return The codomain of result between [0,1], or in the range of can being mapped to this by calculating, below as a example by logistic regression. Obtain in step S1013 is the vector of a rule, such as (0,0,0,0,0,0,0,0,0,12,32,43 ... 1,0, 0......1,0,0......), because these vectors there may be 300000 dimensions, represent that data volume can quite expend internal memory, institute Be converted into the form of sparse matrix with the vector by a rule, such as, if upper one is Article 1, then abscissa is 1, accordingly The form of sparse matrix be: 110(is equivalent to vertical coordinate) 12,11132,11243 etc..After each the most so converts, In model training program program, input is the sparse matrix produced above, and output is the parameter that each record is corresponding.Can be simple Be interpreted as if a record is (x₁,x₂,x₃,x₄,x₅..., x_p), p is the data volume of the data of setting data form；Logical Cross model training program to solve, produce (β₀,β₁,β₂,β₃,β₄,β₅..., β_p) etc. corresponding parameter.Now set up recurrence Model, regression model is represented by:Wherein g (x)=β₀+β₁x₁+β₂x₂+…+β_px_p。

After setting up regression model, when receiving user again and releasing news, then step S101 particularly as follows:

Step S1015, obtains the master data that user gives out information；Including extract the basic feature that gives out information of user and Obtain unit's feature；Basic feature and first feature are together as the master data excavated.Particular content is identical with step S1011, this Step is not described in detail.

After obtaining the master data that gives out information of user, step S201 specifically includes following steps:

Step S2011, the master data giving out information acquisition user carries out Feature Conversion, obtains setting data form Data.This step is identical with step S1013 method, no longer describes in detail.

Step S2012, the data of setting data form step S2011 obtained are converted into the form of sparse matrix, logical Cross regression model and carry out spoofing identification.In this step, after obtaining sparse matrix, give out information correspondence according to the user obtained (x₁,x₂,x₃,x₄,x₅..., x_p), it is possible to obtain g (x), thus can in the hope of the result of P (Y=1 | x), i.e. Y=1's Probability；Wherein, P > M, then Y=1, represent that user releases news as true sale information；Otherwise, P≤M, then Y=0, represent that user sends out Cloth information is wash sale information；M is threshold value set in advance.

Secondly, describe the information content released news based on user and carry out wash sale information identification, send out obtaining user Before the information content of cloth information, comprise the following steps:

Step S1021, the information content that before acquisition, user gives out information, and foregoing (is manually examined by examination & verification Core or automatically examination & verification), using by examination & verification with not by the Transaction Information model of examination & verification as two classes, as the sample number classified According to；The algorithm that can pass through expert's manual tag and part accuracy rate high (higher than arranging threshold value) automatically extracts positive negative sample instruction Practice collection；

Step S1022, carries out participle to the information content in sample, can be by the method optimizing participle of Custom Dictionaries Effect.Concrete segmenting method can use existing segmenting method, such as ICT segmenting method or other segmenting method.

Step S1023, extraction feature word.In this step, filter out and step S1022 participle stops word, rare words, common Word, then with CHI(evolution check) etc. method choose the Feature Words big with class degree of association.Concrete choosing method is: to each word All seek CHI value, then take 1000 maximum values of CHI value in all words as Feature Words.Evolution inspection formula is:Wherein, A: comprise the number of documents of this word under this is classified；B: do not comprise under this classification The number of documents of this word；C: do not comprise the number of documents of this word under this is classified；D: not under this classification, and do not comprise The number of documents of this word；N: represent article sum；T: represent the word currently seeking CHI value；The classification of c: presentation class；x²: table Show open inspection CHI value.

Step S1024, carries out vectorization, obtains the eigenvalue of each Feature Words in every document of every apoplexy due to endogenous wind.This step Use TFIDF algorithm, calculate the number of times of each Feature Words in every document of every apoplexy due to endogenous wind, and comprise the number of files of this word, By the value of TFIDF as eigenvalue.Every document is converted into: category IDs t feature sequence number the form of t eigenvalue.TFIDF is public Formula is: TFIDF=TF × IDF, and wherein, TF is the frequency that certain Feature Words occurs in this document, and IDF is anti-document frequency, The most total document tree is divided by the number of files comprising this word.

Step S1025, according to the eigenvalue of each word obtained in every document of every apoplexy due to endogenous wind, is identified mould by training Type.In this step, use SVM(support vector machine support vector machine), decision tree, the mode such as Bayes's classification Features described above value is trained, every document has been converted into the form of vector by step S1024, use classification (Waikato Environment for Knowledge Analysis, Waikato intellectual analysis environment) these vectors are trained by program, Different sorting techniques, such as SVM, decision tree, Bayes's classification etc. can be selected, produce one and identify model.SVM, decision-making Tree, Bayes's classification are the training method of existing maturation, and this step is not described in detail.

After being identified model, when receiving user again and releasing news, then step S101 particularly as follows:

Step S1026, obtains the information content that user gives out information, such as, as a example by user posts, then obtain model Particular content.

After obtaining the information content that gives out information of user, step S201 specifically includes following steps:

Step S2021, the information content giving out information user carries out participle.

Step S2022, extraction feature word.This step is identical with step S1023 method, therefore, is not described in detail.

Step S2023, carries out vectorization, the eigenvalue of each word in the information content that acquisition user gives out information.This Step is identical with step S1024 method, therefore, is not described in detail.

Step S2024, according to the identification model obtained, the information content giving out information user carries out wash sale information Identify.In this step, the identification model obtained by modes such as SVM, decision tree, Bayes's classifications is existing maturity model, its Recognition methods is also existing mature technology, and therefore this step is not described in detail.

Finally, describe and carry out wash sale information identification based on pictorial information, obtaining the picture letter that user releases news After breath, the pictorial information released news according to user, user is given out information and carries out wash sale information identification (step S201) comprise the following steps:

Step S2031, query graph valut, it is judged that whether photo current occurs in picture library, if there is, the most further Judge that content of posting is the most identical, and position is the most identical, if all different, then judge that the user comprising this picture issues letter Breath is wash sale information；Otherwise, then judging that the user comprising this picture releases news is true sale information；Or, it is judged that Whether there is watermark on picture, if it has, the watermark then determined whether on picture is the most legal, if illegal, then judge bag It is wash sale information that user containing this picture releases news；Otherwise, then it is true for judging that the user comprising this picture releases news Real Transaction Information.

It addition, above-mentioned three kinds of strategies can also be combined, combine and judge, such as, two kinds of situation combinations, or Three kinds of situation combinations；It is wash sale letter when above-mentioned three kinds of situations having any one or two kinds of situations judge that user releases news Breath, then judging that user releases news is wash sale information.

As can be seen from the above-described embodiment, the present invention can greatly reduce the false amount of Transaction Information, improves transaction letter The verity of breath, increases Consumer's Experience, can greatly reduce human cost simultaneously.

Although being example purpose, having been disclosed for the preferred embodiments of the present invention, those skilled in the art will be recognized by Various improvement, to increase and replace also be possible, and therefore, the scope of the present invention should be not limited to above-described embodiment.

Claims

1. a wash sale information identifying method, it is characterised in that including:

Step S101, obtains the information characteristics that releases news of user or obtains described information characteristics, information content and picture Information；

Step S201, the information characteristics released news according to user or described information characteristics, information content and pictorial information, User is given out information and carries out wash sale information identification；

Obtain the information characteristics that releases news of user or obtain described information characteristics, information content and pictorial information it Before, comprise the following steps:

Step S1012, the master data given out information according to user before obtaining, extract training data, determine positive negative sample；

Step S1014, according to the data of setting data form, sets up regression model；

Step S1013 specifically includes:

The dimension values of numeric type is constant, and the numerical value of this numeric type data is disposed in the position in numeric type data is in sample；

The dimension values of enumeration type the most first calculates its md5 value, then by md5 value to W delivery, obtains delivery result；In the sample will The numerical value being in delivery result position puts 1.

2. wash sale information identifying method as claimed in claim 1, it is characterised in that step S1014 specifically includes:

Data step S1013 obtained are converted into sparse matrix；

Sparse matrix (the x of generation is inputted in model training program₁,x₂,x₃,x₄,x₅..., x_p), p is setting data form The data volume of data；Obtain each parameter (β corresponding to record₀,β₁,β₂,β₃,β₄,β₅..., β_p)；

Setting up regression model, regression model is:Wherein g (x)=β₀+β₁x₁+β₂x₂+... +β_px_p；

P (Y=1 | x) is the probability of Y=1, and Y=1 represents that user releases news as true sale information.

3. wash sale information identifying method as claimed in claim 2, it is characterised in that after setting up regression model, when Receive user when releasing news, then step S101 obtains information characteristics that user releases news particularly as follows:

Step S1015, obtains the master data that user gives out information；Including extracting basic feature and the acquisition that user gives out information Unit's feature；Basic feature and first feature are together as the master data excavated.

4. wash sale information identifying method as claimed in claim 3, it is characterised in that obtaining the base that user gives out information After notebook data, step S201 specifically includes following steps:

Step S2011, the master data giving out information acquisition user carries out Feature Conversion, obtains the number of setting data form According to；

Step S2012, the data of setting data form step S2011 obtained are converted into the form of sparse matrix, by returning Model is returned to carry out spoofing identification；Wherein, P (Y=1 | x) > M, then Y=1, represent user to release news as true sale and believe Breath；Otherwise, P (Y=1 | x)≤M, then Y=0, represent that user releases news as wash sale information；M is threshold value set in advance.

5. the wash sale information identifying method as described in claim 1 or 4, it is characterised in that release news obtaining user Information content before, comprise the following steps:

Step S1021, information content that before acquisition, user gives out information also audits, will by examination & verification with not by auditing Information be divided into two classes, as classification sample data；

Step S1022, carries out participle to the information content in sample；

Step S1023, by calculating, extraction feature word；

Step S1025, according to the eigenvalue of each Feature Words obtained in every document of every apoplexy due to endogenous wind, is identified mould by training Type.

6. wash sale information identifying method as claimed in claim 5, it is characterised in that step S1023 specifically includes:

Each word is asked evolution check CHI value；Evolution inspection formula is:

Wherein, A: comprise the number of documents of this word under this is classified；B: do not comprise the number of files of this word under this classification Amount；C: do not comprise the number of documents of this word under this is classified；D: not under this classification, and do not comprise the number of files of this word Amount；N: represent article sum；T: represent the word currently seeking CHI value；The classification of c: presentation class；x²: represent evolution inspection CHI Value；

Then the P that in all words, CHI value is maximum is taken individual as Feature Words；

Step S1024 specifically includes:

Use TFIDF algorithm, calculate the number of times of each Feature Words in every document of every apoplexy due to endogenous wind, and comprise the document of this word Number, by the value of TFIDF as eigenvalue；Wherein, every document is converted into: category IDs t feature sequence number the lattice of t eigenvalue Formula；TFIDF formula is: TFIDF=TF × IDF, and wherein, TF is the frequency that certain Feature Words occurs in this document, and IDF is Anti-document frequency, i.e. total number of files is divided by the number of files comprising this word.

7. wash sale information identifying method as claimed in claim 6, it is characterised in that obtaining the letter that user gives out information After breath content, step S201 specifically includes following steps:

Step S2022, by calculating, extraction feature word；

Step S2023, the eigenvalue of each Feature Words in the information content that calculating user gives out information；

Step S2024, according to the identification model obtained, the information content giving out information user carries out wash sale information knowledge Not.

8. the wash sale information identifying method as described in claim 1,4 or 7, it is characterised in that release news according to user Pictorial information, user is given out information and carries out wash sale information identification, specifically include following steps:

Step S2031, query graph valut, it is judged that whether photo current occurs in picture library, if there is, then determine whether Content of posting is the most identical, and position is the most identical, if all different, then judging that the user comprising this picture releases news is Wash sale information；Otherwise, then judging that the user comprising this picture releases news is true sale information；

Or, it is judged that whether there is watermark on picture, if it has, the watermark then determined whether on picture is the most legal, if not Legal, then judging that the user comprising this picture releases news is wash sale information；Otherwise, then the user comprising this picture is judged Releasing news is true sale information.