Background
Improving the service quality and ensuring the safe operation of the network in the network management is an important research subject. Network management needs the support of network log data, and a log data acquisition technology becomes an important research content.
The system log is a very critical component, and can record information of hardware, software and system problems in the system, including the system log, the application log and the security log. Originally, the main use-oriented object of logs was software engineers, who examined problems by reading log information, because system log information was critical to determine the root cause of a failure or to narrow down the scope of system attacks. The system log can enable engineers to quickly know all events before a fault or attack occurs, and can be used for checking the reason of the error or searching traces left by an attacker when the attack occurs. Of course, it is also critical to develop a good set of system logging policies for a virtualized environment, as system logs need to be associated with many different external components.
Aiming at the problems of large-batch log monitoring and log data acquisition, at present, several schemes are mainly popular, including fluent, Logstash, Flume, scriber and the like, wherein LogAgent is adopted in the interior of the Alibara, and LogTail is adopted in the Aliskiu cloud. Fluent in these products takes absolute advantage and successfully resides in CNCF (Cloud Native Computing Foundation), and the proposed Unified Logging Layer (Unified Logging Layer) greatly reduces the complexity of the whole log collection and analysis. Fluentd considers that most existing log formats are poorly structured, which benefits from the excellent ability of humans to parse log data, since log data is initially human-oriented, humans being their primary log data consumers. Therefore, the fluent hopes to reduce the complexity of the whole log collection access by unifying the log storage format, supposing that the log data input under the assumption is in M formats, and N kinds of storage are accessed at the rear end of the log collection Agent (Agent), so that each storage system needs to realize the function of analyzing the M kinds of log formats, the total complexity is M × N, and if the log collection Agent unifies the log formats, the total complexity is M + N. This is the core idea of fluent, and its plug-in mechanism is a favorable place. Logstack and fluent are similar to the ELK technology stack and are widely used in the industry.
In consideration of the characteristic of large log data magnitude, an effective system log strategy can effectively help technicians to better organize a log structure and accurately find log information. The system logging strategy can send warning information to the user just when the fault occurs, and help find the problem in the shortest time. Today, a large number of machines process log data day and night for use by offline and online analysis systems to generate readable reports to assist humans in making decisions.
The invention provides an abnormal log monitoring method aiming at monitoring abnormal conditions and acquiring abnormal information data in the running process of an e-government affair system, and aims to accurately find abnormal information generated in the running process of the system when the system generates large service data volume and log information at TB level and is distributed at different nodes.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient abnormal log monitoring method.
The invention is realized by the following technical scheme:
an abnormal log monitoring method is characterized in that: the method for optimizing and improving the network monitoring point deployment algorithm and the log data acquisition method in the log data acquisition specifically comprises the following steps:
firstly, storing all known elements to form a set R, and judging whether an element x exists in the set R by adopting a Bloom Filter (BF) as a probability algorithm;
secondly, separating data into different areas by adopting a double-layer bucket algorithm design idea, and then respectively sequencing elements in each area by adopting a Bit-map method;
and thirdly, searching and storing a set R corresponding to the element x in the document set and the position of the element x in the set R by adopting an inverted index searching method.
In the first step, a Bloom Filter (BF) maps URLs (Uniform Resource locators) of elements in the set R to a certain bit in a binary bit array (bitmap array) by using a Hash table data structure, and if a bit of an element x corresponding to the binary bit array is already set to 1, it indicates that the element x exists in the set R.
Hash has a collision problem, and the values of two URLs obtained by using the same Hash are probably the same. In order to reduce the conflict, in the first step, a plurality of different Hash functions are used to obtain different Hash table data structures to judge whether the element x exists in the set R;
when the Hash table data structure obtained by any Hash function judges that the element x does not exist in the set R, the element x can be determined not to exist in the set R;
when the Hash table data structures obtained by all Hash functions judge that the element x exists in the set R, the element x is determined to exist in the set R.
In the first step, the implementation steps of judging whether the element x exists in the set R by a Bloom Filter (BF) are as follows:
1) a Bloom Filter (BF) uses a binary digit array with m bits to store information, and in an initial state, the binary digit array comprises the bit array with m bits, each bit is 0, namely the elements of the whole binary digit array are all set to be 0;
2) mapping each element in the set R ═ { x1, x2, …, xn } into a range of {1, …, m } using k mutually independent Hash functions (Hash functions), respectively;
3) when judging whether the element x belongs to the set R, k hash values are obtained by using k hash functions for the element x, if the positions of all hashi (x) are all 1(i, k are natural numbers, i is more than or equal to 1 and less than or equal to k), namely k positions are all set to be 1, the element x is considered to be an element in the set R, otherwise, the element x is considered not to be an element in the set R.
In the step 2), when any element y is added to the Bloom Filter (BF), k hash functions are used to obtain k hash values, and then the corresponding bit in the binary bit array is set to 1, that is, the position hashi (y) mapped by the ith hash function is set to 1(i is greater than or equal to 1 and is less than or equal to k).
In the step 2), when one position is set to be 1 for multiple times, only the first setting is valid, and the later settings are all invalid.
In the second step, a large amount of data to be processed is divided for multiple times by adopting a design idea of a double-layer barrel algorithm, the range is determined step by step, and finally data units which can be processed independently are formed; and when the elements need to be sequenced, processing each data unit by adopting a Bit-map method respectively.
In the second step, when n (m, n are natural numbers, m > n) elements in the m elements need to be sorted, the implementation steps of sorting by using a Bit Map algorithm are as follows:
1) opening up the space of m Bit positions by adopting a Bit-map method, and setting the m Bit positions as 0;
2) and traversing n elements needing to be sorted in sequence, and setting the corresponding bit positions of the elements to be 1.
The invention has the beneficial effects that: the abnormal log monitoring method can timely monitor abnormal log information in a log environment with high concurrency and large data volume, effectively collect and classify the log information, assist workers to find problems in the shortest time and ensure normal and stable operation of the system.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Monitoring (Monitoring) and Logging (Logging) are among the most critical infrastructures in large distributed systems. Because of no monitoring, there is no way to know the operation of the service, and there is no Down machine in the cluster, whether the CPU usage and load of the machine are normal, whether the Traffic of the website is normal, and whether the error rate of the service is within a tolerable range. In short, monitoring allows us to know the operation and availability of a website in real time. Therefore, for the detection of the network and the system operation condition, the collection and processing condition of the log are already one of the standards for measuring the stable operation condition of the system.
Bloom Filter (BF) is a space-efficient random data structure that uses bit arrays to represent a set very compactly and to determine whether an element belongs to the set. It is a fast probabilistic algorithm that determines whether a set of elements exists. The Bloom Filter may make a false determination, but does not miss the determination. That is, the Bloom Filter decision element is no longer aggregated, and that is certainly not. If the judgment element exists in the set, the judgment is wrong with a certain probability. Thus, Bloom filters are not suitable for "zero error" applications. And in the application occasion that can tolerate low error rate, the Bloom Filter greatly saves space compared with other common algorithms (such as hash, half-searching). The method has the advantages that the space efficiency and the query time far exceed those of a common algorithm, and the defects of certain misrecognition rate and difficulty in deletion are overcome.
For the problem of network monitoring point deployment, the problem that monitoring points arranged in an original network are not easy to change after the topological structure of a distributed network is expanded is considered, and an increment selection method of the increment network monitoring points is optimized.
The abnormal log monitoring method optimizes and improves a network monitoring point deployment algorithm and a log data acquisition method in log data acquisition, and specifically comprises the following steps:
firstly, storing all known elements to form a set R, and judging whether an element x exists in the set R by adopting a Bloom Filter (BF) as a probability algorithm;
secondly, separating data into different areas by adopting a double-layer bucket algorithm design idea, and then respectively sequencing elements in each area by adopting a Bit-map method;
and thirdly, searching and storing a set R corresponding to the element x in the document set and the position of the element x in the set R by adopting an inverted index searching method.
Calculating whether a certain element x is in a set, firstly, the conceivable method is to store all known elements to form a set R, and then, comparing the element x with the elements in the set R one by one to judge whether the element x exists in the set R; the method can be realized by adopting a data structure such as a linked list. However, as the number of elements in the set R increases, the memory occupied by the elements increases. If tens of millions of different web pages need to be downloaded, the required memory can occupy the memory address space of the whole process. Even if the methods of MD5 and UUID are used to convert URLs into fixed short strings, the memory usage is quite large.
In the first step, a Bloom Filter (BF) maps URLs (Uniform Resource locators) of elements in the set R to a certain bit in a binary bit array (bitmap array) by using a Hash table data structure, and if a bit of an element x corresponding to the binary bit array is already set to 1, it indicates that the element x exists in the set R.
Hash has a collision problem, and the values of two URLs obtained by using the same Hash are probably the same. In order to reduce the conflict, in the first step, a plurality of different Hash functions are used to obtain different Hash table data structures to judge whether the element x exists in the set R;
when the Hash table data structure obtained by any Hash function judges that the element x does not exist in the set R, the element x can be determined not to exist in the set R;
when the Hash table data structures obtained by all Hash functions judge that the element x exists in the set R, the element x is determined to exist in the set R.
In the first step, the implementation steps of judging whether the element x exists in the set R by a Bloom Filter (BF) are as follows:
1) a Bloom Filter (BF) uses a binary digit array with m bits to store information, and in an initial state, the binary digit array comprises the bit array with m bits, each bit is 0, namely the elements of the whole binary digit array are all set to be 0;
2) mapping each element in the set R ═ { x1, x2, …, xn } into a range of {1, …, m } using k mutually independent Hash functions (Hash functions), respectively;
when any element y is added in Bloom Filter (BF), k hash functions are used to obtain k hash values, and then corresponding bits in the binary bit array are set to be 1, namely the position hashi (y) mapped by the ith hash function is set to be 1(i, k are natural numbers, i is more than or equal to 1 and less than or equal to k);
in the step 2), when one position is set to be 1 for multiple times, only the first setting is valid, and the later settings are all invalid.
3) When judging whether the element x belongs to the set R, k hash values are obtained by using k hash functions for the element x, if the positions of all hashi (x) are all 1(i is more than or equal to 1 and less than or equal to k), namely k positions are all set to be 1, the element x is considered to be an element in the set R, otherwise, the element x is not considered to be an element in the set R.
Double-layer buckets are an algorithm design idea. When a pile of large amount of data cannot be processed directly without using a direct addressing table, the data can be divided into small units, and then the small units are processed according to a certain strategy, thereby achieving the purpose. In the second step, a large amount of data to be processed is divided for multiple times by adopting a design idea of a double-layer barrel algorithm, the range is determined step by step, and finally data units which can be processed independently are formed; and when the elements need to be sequenced, processing each data unit by adopting a Bit-map method respectively.
By reducing multiple times, a double layer is only an example, and divide and conquer is the root (only "divide and conquer"). This idea can also be used when it is sometimes necessary to construct a large data with a small range of data, in contrast to the inverse of this.
For example, to find the number of non-repeating integers out of 2.5 million integers, the memory space is not sufficient to accommodate the 2.5 million integers. Just like the pigeon nest principle, when the integer number is 2^32, the 2^32 number can be divided into 2^8 ^ 256 areas (for example, a single file represents an area), then the data is separated into different areas, and then the different areas are processed by using the Bit Map algorithm. The solution can be conveniently realized as long as enough disk space is available.
In the second step, when n (m, n are natural numbers, m > n) elements in the m elements need to be sorted, the implementation steps of sorting by using a Bit Map algorithm are as follows:
1) opening up the space of m Bit positions by adopting a Bit-map method, and setting the m Bit positions as 0;
2) and traversing n elements needing to be sorted in sequence, and setting the corresponding bit positions of the elements to be 1.
For example, 5 elements (4, 7, 2, 5, 3) within 0-7 are to be sorted (assuming there is no repetition of these elements). To represent 8 numbers, only 8 bits (1Bytes) are needed, and first a space of 1Byte is created, and all Bit positions of the space are set to 0.
Then go through these 5 elements, first the first element is 4, then the corresponding position of 4 is 1 (p + (i/8) | (0x01< (i% 8) can be operated in this way), where the operation relates to the case of Big-ending and Little-ending, here the Big-ending is defaulted), because it is from zero, the fifth position is 1 (as shown in fig. 1):
then the second element 7 is processed again, the eighth Bit is set to 1, then the third element is processed again until all elements are processed finally, the corresponding position is 1, and the state of the Bit of the memory at this time is as shown in fig. 1.
The index is a data structure for fast data search, and a hash table, a binary search and a block search can also be regarded as an index, and the index has the value of obtaining the most relevant, the most complete and the deepest data set in a shorter time. Most commonly used indexes are those based on a sequence table, a hash table, or a B + tree. The inverted index (InvertedIndex) is mainly used in the field of information retrieval, and is a most commonly used data index storage structure, and is used to store a document set corresponding to a word in the document set and a position of the word existing in a file, and therefore is also called a reverse index, and corresponding to the reverse index, is a forward index, and the forward index is used to store a word list of each file and record the position thereof, and the following files and corresponding sentence contents are exemplified as follows.
And searching and storing the file corresponding to the element (algorithm, data and mathematics) in the document set and the position of the element x in the file by using an inverted index searching method.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.