CN108733664B - File classification method and device - Google Patents
File classification method and device Download PDFInfo
- Publication number
- CN108733664B CN108733664B CN201710240448.4A CN201710240448A CN108733664B CN 108733664 B CN108733664 B CN 108733664B CN 201710240448 A CN201710240448 A CN 201710240448A CN 108733664 B CN108733664 B CN 108733664B
- Authority
- CN
- China
- Prior art keywords
- file
- virus
- sample
- killed
- bit array
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000004364 calculation method Methods 0.000 claims abstract description 37
- 241000700605 Viruses Species 0.000 claims description 93
- 230000006870 function Effects 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims 4
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a file classification method and device, which are applied to the technical field of information processing. The file classifying device converts the characteristic information of a file to be classified into a first integer, then determines a corresponding first storage position according to the first integer and a preset position calculation function, and if a numerical value corresponding to the first storage position in a storage space is a first indicated value, determines that the file to be classified is a file of a certain type, wherein the first indicated value is used for indicating the storage space to store the characteristic information of a sample file represented by the first storage position. Therefore, the characteristic information of one sample file of a certain type can be represented by each storage position in the storage space, and whether the characteristic information of the corresponding sample file is stored in the storage space is indicated by the indicating value corresponding to each storage position, so that the characteristic information of the storage space of the sample file of a certain type is greatly reduced, and the efficiency of determining the type of the file to be classified is improved.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a file classification method and device.
Background
In the prior art, when storing a virus library, some characteristics of virus samples are often extracted for storage, and specifically, a fifth version of Message Digest Algorithm (Message Digest Algorithm5, MD5) of each virus sample is stored in a hard disk of a local device. When a file in the local device needs to be checked for viruses according to the virus library, information of a virus sample in the hard disk needs to be loaded into the memory, and then whether the file to be checked and killed is matched with the information of the virus sample is judged.
Generally, when loading the information of the virus sample into the memory, the local device may directly load the MD5 of the virus library into the memory. However, the MD5 features of the virus library are generally large, so that the virus sample cannot be loaded into the memory at one time, and the hard disk of the local device needs to be frequently read, resulting in low virus detection speed. For example, the MD5 characteristic of a common virus sample generally requires 32 bytes to store, and if 1 hundred million virus samples are to be stored, about 2.98GB of space is required to store the virus library. In another case, the local device may build a dictionary tree through the MD5 character string and then load the dictionary tree into the memory, but it takes a long time to initialize the dictionary tree, and if the scale of the MD5 in the hard disk is large, the problem of insufficient memory is easily caused.
Disclosure of Invention
The embodiment of the invention provides a file classification method and device, which can be used for determining whether a file to be classified is a file of a certain type according to the numerical value of a first storage position corresponding to the characteristic information of the file to be classified in a storage space for storing the characteristic information of the file of the certain type.
The embodiment of the invention provides a file classification method, which comprises the following steps:
acquiring characteristic information of a file to be classified, and converting the characteristic information of the file to be classified into a first integer;
determining a corresponding first storage position according to the first integer and a preset position calculation function;
if the numerical value corresponding to the first storage position is a first indicating value in a storage space for storing the characteristic information of a certain type of sample file, determining that the file to be classified is the certain type of file, wherein the first indicating value is used for indicating the storage space to store the characteristic information of the sample file represented by the first storage position.
An embodiment of the present invention provides a file classifying device, including:
the integer unit is used for acquiring the characteristic information of the file to be classified and converting the characteristic information of the file to be classified into a first integer;
the position determining unit is used for determining a corresponding first storage position according to the first integer and a preset position calculation function;
the file classification method includes a first type determining unit, configured to determine that a file to be classified is a file of a certain type if a value corresponding to a first storage location in a storage space in which feature information of a certain type of sample file is stored is a first indication value, where the first indication value is used to indicate that the storage space stores the feature information of the sample file represented by the first storage location.
It can be seen that, in the method of this embodiment, the file classifying device obtains the feature information of the file to be classified, converts the feature information into a first integer, and then determines a corresponding first storage location according to the first integer and a preset location calculation function, if a value corresponding to the first storage location in a storage space storing the feature information of a certain type of sample file is a first indication value, it determines that the file to be classified is the certain type of file, where the first indication value is used to indicate that the storage space stores the feature information of the sample file represented by the first storage location. Therefore, the characteristic information of one sample file of a certain type can be represented by each storage position in the storage space, and whether the characteristic information of the corresponding sample file is stored in the storage space is indicated by the indicating value corresponding to each storage position, so that the characteristic information of the storage space of the sample file of a certain type is greatly reduced, the time and the space for loading the characteristic information of the sample file into the memory in the process of determining the type of the file to be classified are saved, and the efficiency of determining the type of the file to be classified is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a method for classifying documents according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for setting values of storage locations in a storage space according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for storing MD5 characteristics of a virus sample file according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for determining whether a file to be checked and killed is a virus file in an embodiment of the present invention;
FIG. 5 is a schematic diagram of determining whether a file to be checked and killed is a virus file in an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a document classifying device according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of another document classifying device according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention provides a file classification method which is mainly applied to scenes for classifying a certain unknown type of file (namely a file to be classified), such as scenes for determining whether a certain file is a virus file or not. Specifically, the file classifying device obtains characteristic information of a file to be classified, converts the characteristic information into a first integer, determines a corresponding first storage position according to the first integer and a preset position calculation function, and determines that the file to be classified is a file of a certain type if a numerical value corresponding to the first storage position is a first indication value in a storage space for storing the characteristic information of the sample file of the certain type, wherein the first indication value is used for indicating the storage space to store the characteristic information of the sample file represented by the first storage position. Therefore, the characteristic information of one sample file of a certain type can be represented by each storage position in the storage space, and whether the corresponding characteristic information of the sample file is stored in the storage space is indicated by the numerical value corresponding to each storage position, so that the storage space of the characteristic information of the sample file of a certain type is greatly reduced, the time and the space for loading the characteristic information of the sample file into the memory in the process of determining the type of the file to be classified are saved, and the efficiency of determining the type of the file to be classified is improved.
An embodiment of the present invention provides a method for classifying a file, which is mainly a method executed by a file classifying device, and a flowchart is shown in fig. 1, and includes:
Specifically, when the file classifying device acquires the characteristic information of the file to be classified, the MD5 value of the file to be classified may be calculated; when the feature information is converted into the first integer, the Hash value obtained by performing Hash calculation on the feature information is mainly used as the first integer, where the Hash calculation may be any Hash Algorithm, such as a Secure Hash Algorithm (SHA).
The MD5 value of the file to be classified can ensure complete and consistent information transmission, and has the characteristics of compressibility, easiness in calculation, modification resistance, strong collision resistance and the like. MD5 functions to allow large volumes of information to be "compressed" into a secure format before signing the private key with digital signature software, specifically by transforming a byte string of arbitrary length into a fixed-length hexadecimal digital string.
The hash algorithm maps an arbitrary length binary value to a shorter fixed length binary value, and this small binary value is called a hash value. Hash values are a unique and extremely compact representation of a piece of data, and if a piece of plaintext is hashed and even if only one letter of the piece is altered, a different value will result from the hash calculation. In the embodiment, Hash calculation is adopted, so that the characteristic information of different files to be classified corresponds to different integers.
And 102, determining a corresponding first storage position according to the first integer and a preset position calculation function.
The preset position calculation function is any function for obtaining a certain storage position through calculation of a certain integer, in this embodiment, the first storage position can be obtained through the first integer and the position calculation function, and the position calculation function is preset in the file classifying device. And the first storage location determined in this embodiment is used to indicate a certain location in the storage space where the feature information of a certain type of sample file is stored.
The specific form of the first storage location and the preset location calculation function mainly depends on the form of the characteristic information of the sample file stored in the storage space, the first storage location may be a first location positioning coordinate, and the location calculation function may be a location positioning coordinate of the integer in the storage space, where n is an integer greater than 1, and a quotient value and a remainder value of a certain integer pair n are used as the integer. Specifically, in this embodiment, the file classifying device may use a quotient value and a remainder value of the first integer pair n as a first position location coordinate of the first integer in the storage space.
For example: the storage space stores the characteristic information of the sample file by a 4 × 16 bit array as shown in table 1 below. The position calculation function is then: the quotient and remainder of a certain integer pair 16(n is 16) are used as the location coordinates of the integer in the storage space, e.g., 29 has location coordinates of (1, 13) for representing the locations of a [1] and bit 13.
bit0 | bit2 | bit3 | bit4 | … | bit12 | bit13 | bit14 | bit15 | |
a[0] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[1] | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
a[2] | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[3] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
TABLE 1
If an object has only two values, the object can be represented by a binary bit, the bit array is the array used to store the object, and the array is internally an array of integers, each bit of the integer representing an object. The bit array shown in table 1 may represent 4 × 16 objects, each object represents feature information of a sample file of a certain type, two values 0 and 1 of the object respectively represent that the feature information of the sample file is not stored in the storage space, and the feature information of the sample file is stored in the storage space, for example, the values (1) of the objects represented by a [2] and bit2 in table 1 represent that the feature information of the sample file is stored in the storage space, and the integer corresponding to the feature information of the sample file is obtained according to the corresponding position calculation function as 34.
103, judging whether a numerical value corresponding to a first storage position in a storage space for storing the characteristic information of a certain type of sample file is a first indicating value, and if the numerical value is the first indicating value, executing a step 104; if not the first indication value but the second indication value, step 105 is performed.
The first indication value is used for indicating that the characteristic information of the sample file represented by the first storage location is stored in the storage space, and specifically may be 1; the second indication value is used to indicate that the characteristic information of the sample file represented by the first storage location is not stored in the storage space, and may be specifically 0.
In other embodiments, the first indicator value may be 1 and the second indicator value may be 0.
And 104, determining that the file to be classified is a file of a certain type.
Therefore, in the method of this embodiment, the feature information of one sample file of a certain type can be represented by each storage location in the storage space, and whether the feature information of the corresponding sample file is stored in the storage space is indicated by a numerical value corresponding to each storage location, so that the storage space of the feature information of the sample file of a certain type is greatly reduced, time and space for loading the feature information of the sample file into the memory in the process of determining the type of the file to be classified are also saved, and the efficiency for determining the type of the file to be classified is improved.
In a specific embodiment, the document classifying device may store a certain type of sample document in the storage space through the following steps 201 to 203, and the flowchart is shown in fig. 2 and includes:
Specifically, when the file classifying device acquires the characteristic information of the sample to be obtained, MD5 values of a plurality of sample files of known types can be calculated respectively; when the sample characteristic information is converted into the sample integer, a hash value obtained by performing hash calculation on the sample characteristic information is mainly used as the sample integer, and the hash calculation can be any hash algorithm.
The position calculation function may use a quotient and a remainder of a certain integer pair n as position location coordinates, in this embodiment, the file classifying device may use the quotient and the remainder of the sample integer pair n as corresponding sample position location coordinates, and the sample storage location includes the sample position location coordinates.
In step 203, the value corresponding to the sample storage position in the storage space is set as a first indication value, and the values corresponding to other positions are set as second indication values.
Therefore, the characteristic information of a sample file of a certain type can be represented by each storage position in the storage space, and whether the characteristic information of the corresponding sample file is stored in the storage space is indicated by the corresponding numerical value of each storage position.
Further, the file classifying device may further continuously update the feature information of a certain type of sample file stored in the storage space, for example, the feature information of the type of sample file is newly added, in this case, it is necessary to determine the latest storage location corresponding to the feature information of the type of newly added sample file according to the location calculation function, and then set the value of the latest storage location in the storage space as the first indication value. The method for determining the first storage location is similar to the method for determining the latest storage location, except that the latest storage location is determined according to the feature information of the newly added sample file, and the first storage location is determined according to the feature information of the file to be classified.
In the following, a specific embodiment is described as a file classifying method of the present invention, in this embodiment, a file classifying device is a virus searching and killing device, and a file to be classified is a file to be searched and killed, in this embodiment:
(1) the virus searching and killing device or the cloud server stores feature information of a virus sample file, specifically, MD5 features, and the flowchart is shown in fig. 3 and includes:
Specifically, assuming that the integer number obtained in step 301 is SUM, in order to prevent collision of integers corresponding to MD5 features of multiple virus sample files, and implement one position in an integer-corresponding bit array corresponding to MD5 features of one virus sample file, the bit array may be set to be m times of the SUM of the integer number, and m may be 8 in this embodiment.
In a specific embodiment, the bit array may be set to t × n bit array, each bit may represent the MD5 feature of one virus sample file, and the value of each bit may be 1 or 0, where n is 32 and t ═ SUM (SUM × 8)/32+ 1.
For example, a 10 x 32 bit array may be provided as shown in table 2 below:
bit0 | bit2 | bit3 | bit4 | … | bit28 | bit29 | bit30 | bit31 | |
a[0] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[1] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[2] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[3] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
… | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[9] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
TABLE 2
Specifically, in this embodiment, the position calculation function may be: taking the quotient and remainder of the integer (NUM) pair 32(n is 32) as the position location coordinates of the corresponding integer, specifically, the following formula 1 may be used:
[NUM/32]|(1<<(NUM%32))(1)
in this embodiment, the location coordinates determined by the virus searching and killing device or the server are a [ i ] and bitj, where i is an integer from 0 to 9, and j is an integer from 0 to 31.
In step 304, the virus searching and killing device or server sets the value of the position corresponding to each position location coordinate determined in step 303 in the bit array to 1, which is used to indicate the MD5 feature of the virus sample file corresponding to the position stored in the bit array.
For example, the MD5 feature of a certain virus sample file is "25 e41a91a6a83f9b400e2ff1fc28a1f 9", the hash value corresponding to the MD5 feature is 45, and the position-location coordinates corresponding to 45 are determined to be a [1] and bit13 through the above step 303, and then the value of bit13 of a [1] is set to 1.
For another example, the virus killing apparatus or server determines the corresponding position location coordinates as a [1] and bit0, a [2] and bit31, a [3] and bit30, a [0] and bit31, and a [9] and bit30 for the hash values 32, 95, 126, 31, and 288 corresponding to the MD5 features of other virus sample files, and sets the value of the corresponding position to 1, which may be specifically as shown in table 3 below:
bit0 | … | bit12 | bit13 | bit14 | … | bit29 | bit30 | bit31 | |
a[0] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
a[1] | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
a[2] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
a[3] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
… | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
a[9] | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
TABLE 3
(2) Determining whether the local file is a virus file, the flowchart is shown in fig. 4, and includes:
in step 401, as shown in fig. 5, the virus killing apparatus loads the bit array stored in the local hard disk or the bit array downloaded from the server into the memory of the virus killing apparatus.
It can be understood that the virus checking and killing device may initiate the process of this embodiment for the local file periodically according to a preset period; or according to the operation of the user on the virus searching and killing device, the flow of the embodiment is initiated for a specific file.
By the method of the embodiment, the following effects can be achieved:
(1) and reducing the storage space of the characteristic information of the virus sample file. Compared with the MD5 feature (a 32-byte character string) of a virus file directly stored in the prior art, in this embodiment, only one bit can store the MD5 feature, which is equivalent to saving the storage space by 256(32 × 8) times.
(2) Aiming at the virus sample file with large data volume, the characteristic information of the virus sample file can be loaded to the memory at one time. If storing 1 hundred million MD5 characteristics needs 32 hundred million bytes, the corresponding memory needs about 2.98GB, and many virus killing devices cannot allocate the corresponding memory at one time, but by using the method of the embodiment, the allocated memory is greatly reduced, and the characteristic information of the virus sample file can be loaded to the memory at one time.
(3) And the matching efficiency is improved. The time complexity of the MD5 signature of the file to be checked and killed in the prior art is O (n) when matching with the MD5 signature of the virus sample file, whereas the time complexity is O (1) with the method of the present embodiment.
An embodiment of the present invention further provides a file classifying device, a schematic structural diagram of which is shown in fig. 6, and the file classifying device may specifically include:
the integer unit 10 is used for acquiring the characteristic information of the files to be classified and converting the characteristic information of the files to be classified into a first integer;
a position determining unit 11, configured to determine a corresponding first storage position according to the first integer obtained by the integer unit 10 and a preset position calculation function;
a first type determining unit 12, configured to determine that the file to be categorized is the file of the certain type if a value corresponding to a first storage location determined by the location determining unit 11 is a first indication value in a storage space in which feature information of a certain type of sample file is stored, where the first indication value is used to indicate that the storage space stores the feature information of the sample file represented by the first storage location.
In this embodiment, the integer unit 10 is specifically configured to use a hash value obtained by performing hash calculation on feature information of a file to be categorized as the first integer. The position determining unit 11 is specifically configured to use a quotient value and a remainder value of the first integer pair n as a first position location coordinate of the first integer in a storage space, where the first storage location is the first position location coordinate, and n is an integer greater than 1.
Wherein the sample file of the certain type is a virus sample file, the first indication value is 1, and the second indication value is 0.
It can be seen that in the file classifying device of this embodiment, the integer unit 10 obtains the feature information of the file to be classified, and converts the feature information into the first integer, then the position determining unit 11 determines the corresponding first storage position according to the first integer and the preset position calculating function, if the value corresponding to the first storage position is the first indication value in the storage space storing the feature information of a certain type of sample file, the first type determining unit 12 determines that the file to be classified is a certain type of file, where the first indication value is used to indicate the storage space to store the feature information of the sample file represented by the first storage position. Therefore, the characteristic information of one sample file of a certain type can be represented by each storage position in the storage space, and whether the corresponding characteristic information of the sample file is stored in the storage space is indicated by the numerical value corresponding to each storage position, so that the storage space of the characteristic information of the sample file of a certain type is greatly reduced, the time and the space for loading the characteristic information of the sample file into the memory in the process of determining the type of the file to be classified are saved, and the efficiency of determining the type of the file to be classified is improved.
Referring to fig. 7, in a specific embodiment, the document classifying apparatus may further include a setting unit 13 and a second type determining unit 14 in addition to the structure shown in fig. 6, wherein:
the second type determining unit 14 is further configured to determine that the file to be categorized is not the file of the certain type if the value corresponding to the first storage location determined by the location determining unit 11 is a second indication value in a storage space in which the feature information of the sample file of the certain type is stored, where the second indication value is used to indicate that the storage space does not store the feature information of the sample file represented by the first storage location.
The integer unit 10 is further configured to obtain feature information of the multiple sample files of the certain type, and convert the feature information of the multiple sample files into multiple sample integers, respectively; the position determining unit 11 is further configured to determine, according to the position calculating function, sample storage positions corresponding to the plurality of sample integers determined by the integer unit 10, respectively; the setting unit 13 is configured to set a value corresponding to the sample storage location determined by the location determining unit 11 in the storage space as the first indication value, and set values corresponding to other locations as the second indication value.
Further, the position determining unit 11 is further configured to determine, according to the position calculating function, a latest storage position corresponding to the feature information of the certain type of newly added sample file; in this case, the setting unit 13 is further configured to set a value of a latest storage location in the storage space determined by the location determining unit 11 as the first indication value.
The present invention further provides a terminal device, a schematic structural diagram of which is shown in fig. 8, where the terminal device may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 20 (e.g., one or more processors) and a memory 21, and one or more storage media 22 (e.g., one or more mass storage devices) storing the application programs 221 or the data 222. Wherein the memory 21 and the storage medium 22 may be a transient storage or a persistent storage. The program stored in the storage medium 22 may include one or more modules (not shown), each of which may include a series of instruction operations for the terminal device. Still further, the central processor 20 may be arranged to communicate with the storage medium 22, and to execute a series of instruction operations in the storage medium 22 on the terminal device.
Specifically, the application program 221 stored in the storage medium 22 includes an application program for classifying files, and the program may include the integer unit 10, the position determining unit 11, the first type determining unit 12, the setting unit 13, and the second type determining unit 14 in the file classifying device, which will not be described in detail herein. Further, the central processor 20 may be configured to communicate with the storage medium 22, and execute a series of operations corresponding to the application program for classifying the files stored in the storage medium 22 on the terminal device.
The terminal equipment may also include one or more power supplies 23, one or more wired or wireless network interfaces 24, one or more input-output interfaces 25, and/or one or more operating systems 223, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like.
The steps performed by the file classifying means in the above-described method embodiment may be based on the structure of the terminal device shown in fig. 8.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The method and the device for classifying files provided by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (8)
1. A method for determining whether a file to be checked and killed is a virus file, comprising:
the virus searching and killing equipment or the server obtains MD5 characteristics of a plurality of virus sample files, and carries out hash calculation on the MD5 characteristics of the plurality of virus sample files respectively to obtain corresponding hash values; the hash value corresponding to each virus sample file is an integer; the virus killing device or the server stores the MD5 characteristics of the virus sample file;
the virus searching and killing equipment or the server sets a bit array of t x n bits, the numerical value of each position in the bit array is initialized to 0, and n is m times of the integer number corresponding to the plurality of virus sample files;
the virus searching and killing equipment or the server determines the position positioning coordinates of the hash values corresponding to the MD5 features of the virus sample files according to a preset position calculation function;
the virus searching and killing equipment or the server sets the numerical value of the position corresponding to each determined position location coordinate in the bit array to be 1, and the numerical value is used for representing the MD5 characteristic of the virus sample file corresponding to the position stored in the bit array;
the virus searching and killing equipment judges whether the numerical value of the position corresponding to the position positioning coordinate corresponding to the hash value of the MD5 characteristic of the file to be searched and killed in the bit array loaded in the memory is 1, if so, the file to be searched and killed is determined to be a virus file, and if so, the file to be searched and killed is determined not to be a virus file.
2. The method of claim 1, wherein the position calculation function is: the position location coordinates comprise quotient values and remainder values of hash values corresponding to the MD5 features of the virus sample file to the n.
3. The method of claim 1, wherein the virus searching and killing device determines whether the value of the position corresponding to the corresponding position positioning coordinate of the file to be searched and killed in the bit array loaded in the memory is 1, further comprising:
the virus searching and killing equipment stores the bit array in a local hard disk of the virus searching and killing equipment or downloads the bit array from the server;
and the virus checking and killing equipment loads the digit array into a memory.
4. An apparatus for determining whether a file to be checked and killed is a virus file, comprising:
the integer unit is used for acquiring MD5 characteristics of a plurality of virus sample files and respectively carrying out hash calculation on the MD5 characteristics of the plurality of virus sample files to obtain corresponding hash values; the hash value corresponding to each virus sample file is an integer; the virus killing device or the server stores the MD5 characteristics of the virus sample file;
the setting unit is used for setting a t x n bit digit array and initializing the numerical value of each position in the digit array to 0, wherein n is m times of the integer number corresponding to the plurality of virus sample files;
the position determining unit is used for determining the position positioning coordinates of the hash values corresponding to the MD5 features of the virus sample files according to a preset position calculation function;
the setting unit is further configured to set a numerical value of a position in the bit array corresponding to the determined each position location coordinate to 1, so as to indicate that the MD5 feature of the virus sample file corresponding to the position is stored in the bit array;
the first type determining unit is used for judging whether a numerical value of a position corresponding to a position positioning coordinate corresponding to a hash value of MD5 characteristics of a file to be checked and killed in a bit array loaded in a memory is 1 or not, and if the numerical value is 1, determining that the file to be checked and killed is a virus file;
and the second type determination unit is used for determining that the file to be searched and killed is not a virus file if the file to be searched and killed is 0.
5. The apparatus of claim 4, wherein the position calculation function is: the position location coordinates comprise quotient values and remainder values of hash values corresponding to the MD5 features of the virus sample file to the n.
6. The apparatus of claim 4, wherein the first type determining unit is further configured to store the bit array in a local hard disk of a virus killing device or download the bit array from a server; and loading the bit array into a memory.
7. A computer-readable storage medium, characterized in that it stores a plurality of computer programs adapted to be loaded by a processor and to carry out the method of determining whether a file to be killed is a virus file according to any one of claims 1 to 3.
8. A terminal device comprising a processor and a memory;
the memory is used for storing a plurality of computer programs, and the computer programs are loaded by the processor and used for executing the method for determining whether the file to be checked and killed is a virus file according to any one of claims 1 to 3; the processor is configured to implement each of the plurality of computer programs.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710240448.4A CN108733664B (en) | 2017-04-13 | 2017-04-13 | File classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710240448.4A CN108733664B (en) | 2017-04-13 | 2017-04-13 | File classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108733664A CN108733664A (en) | 2018-11-02 |
CN108733664B true CN108733664B (en) | 2022-05-03 |
Family
ID=63923800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710240448.4A Active CN108733664B (en) | 2017-04-13 | 2017-04-13 | File classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108733664B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809752B1 (en) * | 2005-04-14 | 2010-10-05 | AudienceScience Inc. | Representing user behavior information |
CN103067364A (en) * | 2012-12-21 | 2013-04-24 | 华为技术有限公司 | Virus detection method and equipment |
CN103164408A (en) * | 2011-12-09 | 2013-06-19 | 阿里巴巴集团控股有限公司 | Information storage and query method based on vertical search engine and device thereof |
CN103164651A (en) * | 2011-12-15 | 2013-06-19 | 西门子公司 | Device and method for extracting virus file feature code and virus detection system |
WO2014081727A1 (en) * | 2012-11-20 | 2014-05-30 | Denninghoff Karl L | Search and navigation to specific document content |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
CN104657451A (en) * | 2015-02-05 | 2015-05-27 | 百度在线网络技术(北京)有限公司 | Processing method and processing device for page |
CN105306063A (en) * | 2015-10-12 | 2016-02-03 | 浙江大学 | Optimization and recovery methods for record type data storage space |
CN106487833A (en) * | 2015-08-26 | 2017-03-08 | 北京国双科技有限公司 | The statistical method of isolated user number and device in network monitor |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090012984A1 (en) * | 2007-07-02 | 2009-01-08 | Equivio Ltd. | Method for Organizing Large Numbers of Documents |
CN101777056B (en) * | 2009-12-31 | 2012-01-04 | 成都市华为赛门铁克科技有限公司 | Data storage method and device |
CN103037344B (en) * | 2012-12-06 | 2016-04-20 | 亚信科技(中国)有限公司 | A kind of ticket De-weight method and device |
CN104090895B (en) * | 2013-12-18 | 2015-11-18 | 深圳市腾讯计算机系统有限公司 | Obtain the method for radix, device, server and system |
CN104751055B (en) * | 2013-12-31 | 2017-11-03 | 北京启明星辰信息安全技术有限公司 | A kind of distributed malicious code detecting method, apparatus and system based on texture |
CN105069020B (en) * | 2015-07-14 | 2018-09-21 | 国家信息中心 | Three-dimensional visualization method and system for natural resource data |
CN105183855A (en) * | 2015-09-08 | 2015-12-23 | 浪潮(北京)电子信息产业有限公司 | Information classification method and system |
-
2017
- 2017-04-13 CN CN201710240448.4A patent/CN108733664B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809752B1 (en) * | 2005-04-14 | 2010-10-05 | AudienceScience Inc. | Representing user behavior information |
CN103164408A (en) * | 2011-12-09 | 2013-06-19 | 阿里巴巴集团控股有限公司 | Information storage and query method based on vertical search engine and device thereof |
CN103164651A (en) * | 2011-12-15 | 2013-06-19 | 西门子公司 | Device and method for extracting virus file feature code and virus detection system |
WO2014081727A1 (en) * | 2012-11-20 | 2014-05-30 | Denninghoff Karl L | Search and navigation to specific document content |
CN103067364A (en) * | 2012-12-21 | 2013-04-24 | 华为技术有限公司 | Virus detection method and equipment |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
CN104657451A (en) * | 2015-02-05 | 2015-05-27 | 百度在线网络技术(北京)有限公司 | Processing method and processing device for page |
CN106487833A (en) * | 2015-08-26 | 2017-03-08 | 北京国双科技有限公司 | The statistical method of isolated user number and device in network monitor |
CN105306063A (en) * | 2015-10-12 | 2016-02-03 | 浙江大学 | Optimization and recovery methods for record type data storage space |
Non-Patent Citations (3)
Title |
---|
Hash-AV: fast virus signature scanning by cache-resident filters;O. Erdogan 等;《 GLOBECOM "05. IEEE Global Telecommunications Conference, 2005.》;20060123;6-12 * |
一种新型网页篡改检测技术;刘鹏程;《绍兴文理学院学报(自然科学)》;20140928;第34卷(第3期);15-19 * |
基于PE文件结构异常的未知病毒检测;樊震 等;《计算机技术与发展》;20091010;第19卷(第10期);160-163 * |
Also Published As
Publication number | Publication date |
---|---|
CN108733664A (en) | 2018-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12189693B2 (en) | Method and system for document similarity analysis | |
CN106407201B (en) | A data processing method, apparatus and computer readable storage medium | |
CN103699585B (en) | Methods, devices and systems for file metadata storage and file recovery | |
CN108563952B (en) | File virus detection method and device and storage medium | |
US9509333B2 (en) | Compression device, compression method, decompression device, decompression method, information processing system, and recording medium | |
US20160056839A1 (en) | Compression device, compression method, decompression device, decompression method, and computer-readable recording medium | |
CN110442642B (en) | Data processing method and device for distributed database and storage medium | |
CN108829650B (en) | Card number generation method, device, server and storage medium | |
CN106709336A (en) | Method and apparatus for identifying malware | |
US20170249218A1 (en) | Data to be backed up in a backup system | |
CN106055363A (en) | A method for identifying files and a mobile terminal | |
CN113127125B (en) | Page automatic adaptation method, device, equipment and storage medium | |
CN105446975A (en) | File packing method and device | |
US20200042422A1 (en) | Log analysis method, system, and storage medium | |
KR20170040343A (en) | Adaptive rate compression hash processing device | |
JP6350296B2 (en) | Processing program, processing apparatus, and processing method | |
US9628110B2 (en) | Computer-readable recording medium, encoding apparatus, encoding method, comparison apparatus, and comparison method | |
CN108733664B (en) | File classification method and device | |
CN112257757A (en) | Malicious sample detection method and system based on deep learning | |
CN107066601A (en) | File contrasts management method and system | |
US20160139819A1 (en) | Computer-readable recording medium, encoding device and encoding method | |
CN112965724B (en) | Method and system for determining loading base address range of firmware | |
WO2016127858A1 (en) | Method and device for identifying webpage intrusion script features | |
CN112784596A (en) | Method and device for identifying sensitive words | |
CN114880523B (en) | String processing method, device, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |