Disclosure of Invention
The invention provides an internet monitoring anti-cheating method and device, which aim to more accurately identify the abnormality of the click rate or browsing number of network activities and detect the cheating behavior of the internet network activities.
In order to solve the technical problem, the invention provides an internet monitoring anti-cheating method, which comprises the following steps:
A. carrying out data collection on single network activity by using various monitoring schemes to obtain multiple groups of independent monitoring data;
B. matching the multiple groups of independent monitoring data, judging whether the multiple groups of independent monitoring data belong to the same network behavior, and summarizing all the monitoring data judged to belong to the same network behavior into a log record of the network behavior;
C. and carrying out cheating flow analysis on the log records of each network behavior to obtain an analysis result.
Further, the monitoring schemes include directly embedding codes in a webpage frame where the network behavior occurs, embedding codes in Flash animation or JavaScript scripts in an access page, and installing browser plug-ins or client software on a user machine.
Further, the step of matching the multiple independent sets of monitoring data and determining whether the multiple independent sets of monitoring data belong to the same network behavior includes:
b1, dividing the field of the monitoring data into one or more exact match fields, or dividing the field of the monitoring data into one or more exact match fields and one or more fuzzy match fields;
b2, comparing the independent monitoring data in pairs according to fields;
when the accurate matching fields are compared, when one or more accurate matching fields of two groups of independent monitoring data are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when fuzzy matching fields are compared, when one or more differences of the fuzzy matching fields of two groups of independent monitoring data are larger than a fuzzy threshold value of the field, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when all the accurate matching fields are the same and the difference of all the fuzzy matching fields is smaller than the fuzzy threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
Or
b1, dividing the field of the monitoring data into one or more exact match fields, or dividing the field of the monitoring data into one or more exact match fields and one or more fuzzy match fields;
b2, comparing the independent monitoring data in pairs;
when comparing the accurate matching fields, setting the matching degree of the fields to be 1 when the accurate matching fields of the two groups of independent monitoring data are the same, and setting the matching degree of the fields to be 0 when the accurate matching fields of the two groups of independent monitoring data are different;
when fuzzy matching fields are compared, setting the matching degree of the fields to be a numerical value from 0 to 1 according to the difference of the fuzzy matching fields; adding the matching degrees of all the fuzzy matching fields to obtain a total matching degree;
when one or more matching degrees of the accurate matching fields of the two groups of independent monitoring data are 0, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when the total matching degree of the fuzzy matching field is smaller than the matching threshold, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when the matching degrees of the precise matching fields of the two groups of independent monitoring data are both 1 and the total matching degree of the fuzzy matching fields is greater than the matching threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
Further, the exact match field includes an identity ID of the user machine where the network behavior occurs, the fuzzy match field includes a uniform resource locator URL, a Time when the network behavior occurs, a protocol address IP of the user machine sent by the network behavior, a Browser of the user machine where the network behavior occurs, and an operating system OS of the user machine where the network behavior occurs.
Further, the cheating traffic analysis comprises: and monitoring the degree of mismatching of the same monitoring parameter in a plurality of groups of monitoring data in the network behavior log record to identify forged data.
Further, the analysis result of step C includes the percentage of the cheating traffic in all log records and the data source of the cheating traffic.
In order to solve the above technical problem, the present invention further provides an internet monitoring anti-cheating device, comprising: a plurality of data acquisition modules, a matching module and an analysis module,
the data acquisition module is used for collecting data of single network activity by using a monitoring scheme to obtain monitoring data;
the matching module is used for matching a plurality of groups of independent monitoring data, judging whether the plurality of groups of independent monitoring data belong to the same network behavior or not, and summarizing all the monitoring data judged to belong to the same network behavior into a log record;
and the analysis module is used for carrying out cheating flow analysis on the log records to obtain an analysis result.
Further, the matching module comprises an accurate matching module, a fuzzy matching module and a judging module;
the accurate matching module is used for comparing accurate matching fields of two groups of independent monitoring data and obtaining an accurate comparison result;
the fuzzy matching module is used for comparing fuzzy matching fields of two groups of independent monitoring data and obtaining a fuzzy comparison result;
and the judging module is used for judging whether the multiple groups of independent monitoring data belong to the same network behavior according to the accurate comparison result and the fuzzy comparison result.
Further, the judgment basis of the judgment module is as follows:
when one or more accurate comparison results are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when the difference of one or more fuzzy matching fields is larger than the fuzzy threshold of the field, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when all the accurate matching fields are the same and the difference of all the fuzzy matching fields is smaller than a fuzzy threshold, judging that the two groups of independent monitoring data belong to the same network behavior;
or,
when one or more accurate comparison results are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when the total matching degree of the fuzzy comparison result is smaller than a matching threshold value, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when all the accurate matching fields are the same and the total matching degree of the fuzzy comparison result is greater than the matching threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
Compared with the prior art, the invention obtains a plurality of independent monitoring data by simultaneously carrying out a plurality of data collections on a single network activity. And independent monitoring data in a plurality of data sources are matched and compared to obtain a group of log records of single internet network activities, and the network activities related to cheating are identified more accurately by comparing the records to identify the abnormity of the network activity behaviors.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The embodiment of the invention provides an internet monitoring anti-cheating method, which comprises the following steps:
A. carrying out data collection on single network activity by using various monitoring schemes to obtain multiple groups of independent monitoring data;
B. matching the multiple groups of independent monitoring data, judging whether the multiple groups of independent monitoring data belong to the same network behavior, and summarizing all the monitoring data judged to belong to the same network behavior into a log record;
C. and carrying out cheating flow analysis on the log records to obtain an analysis result.
The embodiment of the invention provides an internet monitoring anti-cheating device, which is characterized by comprising the following steps: a plurality of data acquisition modules, a matching module and an analysis module,
the data acquisition module is used for collecting data of single network activity by using a monitoring scheme to obtain monitoring data; the monitoring schemes of the data acquisition modules can be the same or different;
the matching module is used for matching a plurality of groups of independent monitoring data, judging whether the plurality of groups of independent monitoring data belong to the same network behavior or not, and summarizing all the monitoring data judged to belong to the same network behavior into a log record;
and the analysis module is used for carrying out cheating flow analysis on the log records to obtain an analysis result.
For each network behavior (such as browse/click) of an internet user, a plurality of data collection modules are used for collecting information of the network behavior. At present, there are many monitoring schemes for monitoring network behavior, including embedding codes in a web page frame where the network behavior occurs, embedding codes in Flash animation or Javascript script in an access page, installing a browser plug-in or a client on a user machine, and the like. Different monitoring schemes differ in authority and responsibility, and therefore, monitoring data that can be collected by different data acquisition modules also differs. Each monitoring generates a group of monitoring data recording the information related to the current network behavior, and the monitoring data comprises one or more information fields, such as: uniform user machine ID, behavior time, visited URL, etc. And after the monitoring data are obtained, transmitting the monitoring data to a server through a network for storage.
The invention can be applied to the anti-cheating monitoring process of the internet network activities, such as the anti-cheating monitoring of internet advertisement putting and the anti-cheating monitoring of network research, and can also be the anti-cheating monitoring of other types of network activities.
The anti-cheating method and the anti-cheating device adopt various different monitoring schemes to collect data simultaneously, and can also obtain more data from other monitoring data providers. It should be noted that the fields that can be obtained by different monitoring schemes are not exactly the same, for example, the URL address of the web page browsed by the user may not be obtained in the monitoring scheme with lower authority (for example, the scheme of embedding codes in Flash). In addition, there may be differences between the same field of different data sources. For example: when a user accesses a webpage, the running time of different monitoring codes at different positions of the webpage may be different, so that the behavior time recorded in different data acquisition modules may also be different.
Typically, a data provider maintains a unique user machine ID for each internet user to identify multiple different network behaviors from the same user. In order to identify the association between data of different suppliers, in the multi-data-source anti-cheating system, besides the own user machine ID of the supplier, a uniform user machine ID needs to be provided for each supplier. This unified user machine ID can be achieved by having the data provider read the user information from a fixed location in the Cookie (browser Cookie or Flash Cookie). In order to ensure consistency between the Cookie IDs acquired by all the providers, uniform Cookie IDs are uniformly distributed and managed by a third-party server.
The uniform Cookie ID enables different data providers to store data without adopting the same batch of servers, and each provider can store own monitoring data by adopting an independent technical scheme.
After each data acquisition module collects respective monitoring data, the monitoring data stored in each data acquisition module is summarized to a server to carry out data summarization of multiple data sources:
the data transmission between the data acquisition module and the summary server can be realized by adopting various technical schemes. One mode is that after each data acquisition module collects a certain amount of monitoring data, the monitoring data are uniformly transmitted to a summary server through the internet or other ways; the other mode is that when each data acquisition module acquires any piece of monitoring data, the data acquisition module directly and synchronously pushes the piece of monitoring data to the summary server.
In consideration of the huge data volume brought by a plurality of data sources, the data on the summarizing server can be stored in a distributed mode to solve the problem of mass data storage. One possible solution is to perform distributed storage according to the monitoring data time: transmitting the monitoring data of all data sources in the same time period (for example, in the same day) to the same server for storage; and transmitting the monitoring data of different time periods to different servers for storage.
And matching the monitoring data formed by the same network behavior in different data acquisition modules, and restoring all relevant information of the network behavior as much as possible. Considering that the differences of the monitoring means may cause the monitoring data of different data acquisition modules to be different in the same field, in one embodiment of the present invention, the field of the monitoring data is divided into an exact match field and a fuzzy match field, in other embodiments, the exact match field is a field that must be included, and the fuzzy match field is an optional field, that is, the exact match field must be included in the monitoring data, and the fuzzy match field is not necessarily included.
The exact match field refers to: for one field, if this field of two pieces of monitored data is not the same, then the two pieces of monitored data are not considered to describe the same network behavior. For example, the unified ue ID, since all the data collection modules read a unique unified ue ID, when the unique IDs do not match, it can be directly assumed that two pieces of monitored data cannot be generated by the same network behavior. In addition to the uniform user machine ID, the exact match field may be other fields in other embodiments.
The fuzzy match field refers to: for a field, the monitored data on two matches may not be exactly the same on this field, e.g., the time of occurrence of the network behavior. Due to the time consumption of the webpage loading process and the delay of network transmission, the acquisition time of the same network behavior in different data acquisition modules may not be completely consistent. This is because different codes, scripts, and clients may be triggered at different times during the process from the opening to the loading of the web page, and the recorded network behaviors are not necessarily identical in time. For the situation, when the monitoring data is matched, the time recorded in the two matched monitoring data is not required to be completely consistent, and only the difference between the two times is required to be within a certain range. In addition to network behavior occurrence times, the fuzzy match field may be other fields in other embodiments.
When the accurate matching fields are compared, when one or more accurate matching fields of two groups of independent monitoring data are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when fuzzy matching fields are compared, when one or more differences of the fuzzy matching fields of two groups of independent monitoring data are larger than a fuzzy threshold, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when all the accurate matching fields are the same and the difference of all the fuzzy matching fields is smaller than the fuzzy threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
Or
When comparing accurate matching fields, when the accurate matching fields of two groups of independent monitoring data are the same, setting the matching degree of the field to be 1, and when the accurate matching fields of the two groups of independent monitoring data are different, setting the matching degree of the field to be 0;
when fuzzy matching fields are compared, setting the matching degree of the fields to be a numerical value from 0 to 1 according to the difference of the fuzzy matching fields; adding the matching degrees of all the fuzzy matching fields to obtain a total matching degree;
when one or more matching degrees of the accurate matching fields of the two groups of independent monitoring data are 0, judging that the two groups of independent monitoring data do not belong to the same network behavior; that is, as long as any one of the exact match fields is 0, the data will be judged to be different network behavior.
When the total matching degree of the fuzzy matching field is smaller than the matching threshold, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when the matching degrees of the precise matching fields of the two groups of independent monitoring data are both 1 and the total matching degree of the fuzzy matching fields is greater than the matching threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
In the actual application process, due to limitations such as network reasons and access rights of a specific monitoring scheme, not all data acquisition modules can acquire all relevant information of network behaviors. When the related information is not obtained, a situation that part of fields are empty exists in the monitoring data. For such a case that the field is empty, the fuzzy matching can be adopted to process.
The precise matching field comprises an identity ID of the user machine where the network behavior occurs, the fuzzy matching field comprises a Uniform Resource Locator (URL), a network behavior occurrence Time, a protocol address IP of the user machine sent by the network behavior, a Browser of the user machine of the network behavior, and an operating system OS of the user machine where the network behavior occurs.
The exact match field and the fuzzy match field may have more parameters and metrics, for example only.
And carrying out matching degree calculation on a plurality of different fields, and judging whether the matching is successful by using the final total matching degree. In the scheme of this embodiment, the fields shown in table 1 are used for matching:
TABLE 1
The threshold used by the above features, i.e. the specific numerical value of each score of the matching degree, can be modified according to actual conditions.
And for any two pieces of monitoring data, adding the matching degrees of the two pieces of monitoring data in each rule to obtain a total score of the matching degrees. And if the total score exceeds a preset matching threshold, the two pieces of monitoring data are considered as a pair of monitoring data which are matched successfully. Specifically, if the unified ue IDs of the two pieces of monitored data are different, the total score of the matching degree is directly set to 0. Therefore, the two matched uniform user machine ID requirements of the monitoring data must be completely consistent, and in this embodiment, only two logs of the same uniform user machine ID need to be matched with each other.
After the monitoring data are matched, all the monitoring data judged to belong to the same network behavior in all the data acquisition modules are combined into a log record. The log records may be stored in the server, wherein only one field (e.g., uniform user ID) that needs to be completely consistent among all data acquisition modules is reserved, and the other fields are stored by adding tags corresponding to the data acquisition modules.
The matched log records of a plurality of data acquisition modules have two advantages: firstly, various monitoring schemes can better restore the related information of the network behavior; secondly, the same field in the monitoring data of the multi-data acquisition module has a plurality of monitoring results for comparison. Based on the characteristics, the detection of cheating can be carried out by adopting richer rules than a single data source log.
The conventional cheating analysis rule mainly identifies cheating traffic by analyzing the frequency or periodicity of the network behavior of the same user, and the anti-cheating method of multiple data sources can also monitor unmatched fields to identify forged data.
The analysis results include the percentage of the cheating traffic in all log records and the data source of the cheating traffic.
The way of cheating traffic analysis includes: and monitoring the degree of mismatching of the same monitoring parameter in multiple groups of monitoring data of the same network behavior in the network behavior log record to identify fake data and/or monitoring the frequency or period of the same network behavior to identify fake data.
a. A cheating method of forging data is identified by checking fields that do not match. For example, when an ad publisher delivers an ad, its own monitoring system describes the delivery of the ad slot. One cheating method is to forge the browsing behavior of the advertisement space on other cheap advertisement spaces and shield or modify the URL address acquired by the monitoring system of the advertisement publisher by technical means. By utilizing the log with multiple data sources, the media side can know whether a cheating mode of deceiving the advertisement publisher by using a forged URL exists or not by only acquiring monitoring data which can be matched with the monitoring system of the advertisement publisher from other data sources and matching a real URL in the monitoring data with the advertisement space description.
b. A large number of different rules are designed and combined using rich fields of multiple data sources to overcome the rule limitations of a single data source. For example, if a user has only a few network behaviors in the data of most data sources, and a large number of network behaviors in the data of a specific data source, it indicates that the monitoring code of the data source is being used by a cheater to perform traffic cheating.
By utilizing the analysis result of the cheating flow, the monitoring data of each data acquisition module can be analyzed, and a data source which is possibly targeted by a cheater can be found out. For example, a cheater can effectively identify abnormal situations after comparing the same fields of other data sources when a large number of error fields appear in the data sources by adopting a disguise technical means or when a large number of blank fields appear in the fields by adopting a means of refusing access, and a corresponding technical scheme is adopted to overcome the means of cheating anti-cheating systems or directly communicate with the cheater to require the cheater to stop cheating and limiting a monitoring system.
Based on the above, the matching module in the anti-cheating device of the embodiment of the invention comprises an accurate matching module, a fuzzy matching module and a judging module;
the accurate matching module is used for comparing accurate matching fields of two groups of independent monitoring data and obtaining an accurate comparison result;
the fuzzy matching module is used for comparing fuzzy matching fields of two groups of independent monitoring data and obtaining a fuzzy comparison result;
and the judging module is used for judging whether the multiple groups of independent monitoring data belong to the same network behavior according to the accurate comparison result and the fuzzy comparison result.
The judgment basis of the judgment module is as follows:
when one or more accurate comparison results are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when the difference of one or more fuzzy matching fields is larger than a fuzzy threshold, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when all the accurate matching fields are the same and the difference of all the fuzzy matching fields is smaller than a fuzzy threshold, judging that the two groups of independent monitoring data belong to the same network behavior;
or,
when one or more accurate comparison results are different, judging that the two groups of independent monitoring data do not belong to the same network behavior;
when the total matching degree of the fuzzy comparison result is greater than the matching threshold, judging that the two groups of independent monitoring data do not belong to the same network behavior;
and when all the accurate matching fields are the same and the total matching degree of the fuzzy comparison result is smaller than the matching threshold, judging that the two groups of independent monitoring data belong to the same network behavior.
Compared with the existing anti-cheating method and device, the method has the advantages that:
(1) through the technical scheme of multiple data sources, more related information of network behaviors can be collected, so that the network behaviors related to cheating can be identified more accurately.
(2) The use of multiple data sources can effectively avoid being targeted by cheaters using means such as counterfeit data, and therefore a more stable effect can be obtained.
(3) By analyzing historical data of each data source, the anti-cheating method and system based on multiple data sources can detect the data sources which are possibly targeted by cheaters, and improve the anti-jamming capability of the data sources by improving a monitoring scheme.
The following description will be made by taking anti-cheating monitoring of internet advertisement delivery as an example with reference to fig. 1, 2 and 3:
the system comprises a plurality of Internet advertisement monitoring systems, a server and a server, wherein the Internet advertisement monitoring systems are used for storing, recording and extracting relevant information of each network behavior of a user machine represented by each visiting user object (namely Cookie);
for each network behavior of each visiting Cookie, the monitoring system records one or more information of the uniform unique Identification (ID), visiting time, visiting URL, browsing behavior and the like of the visiting Cookie. In actual operation, there may be differences in the fields recorded by different monitoring systems.
The information and/or browsing behavior of the Cookie client recorded by the single data collection module is shown in table 2.
TABLE 2
And summarizing the monitoring data of the plurality of internet advertisement data acquisition modules. Examples of the summarized data are shown in table 3.
TABLE 3
And the matching module is used for matching any two pieces of monitoring data in the data to be matched and finding out a plurality of logs belonging to the same network behavior. The matching results in each case are illustrated separately below.
TABLE 4
As shown in table 4, the unified ue IDs in the two pieces of monitoring data from different data acquisition modules are not consistent, so that the two pieces of monitoring data fail to match, and belong to different network behaviors.
TABLE 5
As shown in table 5, the uniform ue IDs in the two pieces of monitored data from different data acquisition modules are consistent, so that the fuzzy field matching calculation needs to be continued.
According to the formula shown in Table 1, the matching degree of the time of network behavior transmission is equal to
1/(the number of seconds +1 of the phase difference between the two times) ≈ 1/15993 ≈ 0;
the matching degree of the webpage URL of the network behavior is equal to 0.2;
the matching degree of the browsing behavior is equal to 0.
The total match of 3 ambiguous fields is therefore equal to 0.2.
In this embodiment, the preset threshold of the total matching degree is set to be 1, and then the total matching degree between the two pieces of monitoring data is smaller than the threshold, so that the two pieces of monitoring data do not belong to the same network behavior.
TABLE 6
As shown in table 6, the uniform ue IDs in the two pieces of monitored data from different data acquisition modules are consistent, and the total matching degree of the fuzzy field is further calculated.
According to the formula shown in Table 1, the matching degree of the time of network behavior transmission is equal to
1/(the number of seconds +1 of the phase difference between the two times) ≈ 1/3 ≈ 0.33;
the matching degree of the webpage URL of the network behavior is equal to 0.5;
the matching degree of the browsing behavior is equal to 0.2.
The total match of the ambiguous field is therefore equal to 1.03.
As described above, in this embodiment, the preset threshold of the total matching degree is set to be 1, and then the total matching degree between the two pieces of monitoring data is greater than the threshold, so that the two pieces of monitoring data belong to the same network behavior.
Through the matching process, a plurality of pieces of monitoring data of the same network behavior can be matched, and the monitoring data can be used as the log record of the network behavior.
And (4) cheating analysis, namely analyzing the matched logs to identify the cheating flow. The method of analyzing the log will be described with a method of checking the unmatched fields to identify falsified data as an example in the present embodiment.
TABLE 7
Table 7 shows two matched sets of monitored data from different data acquisition modules, for example, a female channel is placed in the customer-set ad campaign 1.
As seen from the monitoring data of the data acquisition module 1 that does not monitor the URL, the network behavior is not abnormal, but the actual URL of the network behavior can be found to be in the form of sports.bbb.com by analyzing the monitoring data of the data acquisition module 2 that is matched with the network behavior. According to the characteristics of the URL, the fact that the webpage accessed by the network behavior is a sports channel can be judged, and the advertisement activity 1 in the data acquisition module 1 should be launched by a female channel, so that the network behavior can be suspected to possibly relate to a cheating means of disguising the exposure of the advertisement space B by using the exposure of the advertisement space A.
In this embodiment, on the data collection module 1 with advertisement activity information, the advertisement publisher hides the URL of the exposure page so that the monitoring system cannot check the cheating means. The anti-cheating method of multiple data sources finds the exposed URL from another data acquisition module 2 and successfully identifies the cheating action.
In practice, the cheating traffic analysis has various modes, and the above example illustrates that a method in a single data source can be adopted on data of multiple data sources, and a method of cross-validation of multiple data sources can be used for more accurately identifying cheating means.
And (4) cheating result feedback, namely analyzing the results of all the data acquisition modules, and finding out the data source targeted by the cheater for improvement.
In this embodiment, the data collection module 1 is restricted from obtaining URLs by the advertisement publisher. If the situation is common on the data acquisition module 1, feedback can be carried out on the data acquisition module 1, and an advertisement putting person is required to cancel actions such as limitation on the data acquisition module 1, so that the monitoring quality of the data acquisition module 1 is improved.
The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.