CN112256838B

CN112256838B - Similar domain name searching method and device and electronic equipment

Info

Publication number: CN112256838B
Application number: CN202011232693.9A
Authority: CN
Inventors: 李晓东; 王伟; 彭博韬; 张宁; 杨国强
Original assignee: Fuxi Technology Heze Co ltd; Shandong Fuxi Think Tank Internet Research Institute
Current assignee: Fuxi Technology Heze Co ltd; Shandong Fuxi Think Tank Internet Research Institute
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2024-06-28
Anticipated expiration: 2040-11-06
Also published as: CN112256838A

Abstract

The embodiment of the invention provides a similar domain name searching method, a similar domain name searching device and electronic equipment, wherein the method comprises the following steps: acquiring a domain name to be checked; extracting text features of the domain name to be checked, and vectorizing the text features to obtain feature vectors of the domain name to be checked; encoding the domain name feature vector to be checked, and matching a target domain name feature vector group from a preset full domain name database according to the encoding result, wherein the full domain name database comprises a plurality of domain name feature vector groups classified according to the domain name feature vector encoding result; and calculating the distance between the domain name feature vector to be checked and each domain name feature vector in the target feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance. According to the method, similarity calculation between the domain names is converted into similarity comparison between the feature vectors, and particularly into calculation of the distance between the feature vector of the domain name to be checked and each domain name feature vector in the target domain name feature vector group, so that the calculation difficulty is reduced, and the calculation speed is improved.

Description

Similar domain name searching method and device and electronic equipment

Technical Field

The present invention relates to the field of domain name resolution services, and in particular, to a method and apparatus for searching similar domain names, and an electronic device.

Background

In the current network information age, domain name servers (DNS for short) are required to resolve billions to hundreds of billions of Domain name requests per day, without compromising network threat and network attack behavior. Therefore, in order to improve the security of network access, data analysis is required to be performed on a large amount of domain name data, even a huge amount of domain name data, based on the domain name resolution history. However, because the domain name is composed of character strings, the domain names are often different in length and more in nonsensical words, the IP addresses resolved by the domain names are not fixed, and the similarity between the domain names is difficult to directly define. Therefore, in data analysis of domain name data, it is particularly difficult to directly calculate the similarity between domain names.

In recent years, neural networks and deep learning techniques have become increasingly widely used from computer vision to natural language processing to time series prediction. Some researchers have applied neural networks and deep learning techniques to the process of similarity calculation between domain names. However, since the domain name data is often massive, complex and changeable, the general application means has very low execution efficiency for the similarity calculation of the massive domain name data. Further description, for example:

the existing scheme I is as follows:

Collecting DNS query data and preprocessing, and constructing a domain noun table and a domain name sequence accessed by a user; the preprocessed data is transmitted into an unsupervised model Skip gram, related parameters are set, and domain name vectors are trained through the Skip gram model; and calculating the similarity between the domain names through the domain name vectors, and analyzing the user behavior preference.

The Skip gram model used in the existing scheme I belongs to a relatively old model, has poor characteristic expressive power, and cannot be suitable for the current process of carrying out similarity calculation between domain names aiming at massive domain name data. Moreover, skip gram models must be trained on their own before each application, and their training process is time-consuming and labor-consuming on large data sets, and is not universally applicable. When searching the most similar domain name vector, the scheme needs to perform similarity calculation and comparison on all vectors in the all-data set, and the method must sequentially traverse and scan all vectors in the all-data set to locate the vector with the minimum distance.

The existing scheme II:

Collecting a large number of original description information of Internet websites as a website data set according to a DNS server log, preprocessing and manually labeling, extracting high-dimensional feature vector representation of each website for inputting a deep learning model, adding corresponding website category labels to each website, and converting the website category labels into category vectors; the high-dimensional feature vector representation is used as an input of a deep learning model, the category vector is used as an output of the deep learning model, and an Adam gradient descent algorithm optimizer is used for monitoring and training an LSTM-based cyclic neural network deep learning model; and adding a layer of SoftMax regression after the trained LSTM circulating neural network deep learning model to complete the classification algorithm.

The second existing scheme is an end-to-end deep learning classification algorithm, which is necessary to label training sets, is not suitable for analyzing a large amount of unlabeled data, and has great limitation in application.

Disclosure of Invention

Aiming at the problems existing in the prior art, the embodiment of the invention provides a similar domain name searching method, a similar domain name searching device and electronic equipment.

In a first aspect, an embodiment of the present invention provides a similar domain name searching method, including:

Acquiring a domain name to be checked;

extracting text features of the domain name to be checked, and vectorizing the text features to obtain feature vectors of the domain name to be checked;

Encoding the domain name feature vector to be checked, and matching a target domain name feature vector group from a preset full domain name database according to the encoding result, wherein the full domain name database comprises a plurality of domain name feature vector groups classified according to the domain name feature vector encoding result;

And calculating the distance between the domain name feature vector to be checked and each domain name feature vector in the target feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance.

Further, the calculating the distance between the feature vector of the domain name to be checked and each feature vector of the domain name in the target feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance specifically includes:

calculating Euclidean distance between the domain name feature vector to be checked and each domain name feature vector in the target feature vector group, wherein the Euclidean distance refers to the square sum of multidimensional coordinate differences of corresponding points between the domain name feature vector to be checked and one domain name feature vector;

and obtaining a domain name corresponding to the domain name feature vector with the shortest Euclidean distance, and taking the domain name as a similar domain name of the domain name to be checked.

Further, before encoding the domain name feature vector to be checked and matching the target domain name feature vector group from a preset full domain name database according to the encoding result, the method further comprises: the step of constructing the full-scale domain name database specifically comprises the following steps:

acquiring all historical domain names in a historical domain name resolution database;

extracting text features of all the historical domain names, and vectorizing the text features of all the historical domain names to obtain a plurality of domain name feature vectors, wherein each domain name feature vector corresponds to each historical domain name one by one;

And encoding the domain name feature vectors, classifying the domain name feature vectors into a plurality of domain name feature vector groups according to different encoding results, and constructing an adaptive tree index for the domain name feature vector groups, wherein each domain name feature vector group comprises at least one domain name feature vector.

Further, extracting text features of all the historical domain names, and vectorizing the text features of all the historical domain names, wherein the text features are performed through a BERT domain name embedding algorithm;

or by WORD2VEC domain name embedding algorithm;

Or by the globe domain name embedding algorithm.

Further, the encoding the plurality of domain name feature vectors includes:

And carrying out abstract processing and coding processing on the domain name feature vectors through ISAX coding method.

Further, the tree index adopts a single-line Cheng Xiangliang index method;

or, a multithreading parallel vector index method is adopted;

Or, a memory type vector indexing method is adopted.

Further, the similar domain name searching method further comprises the following steps:

And analyzing and judging whether the similar domain name is a malicious domain name or not by a malicious domain name vector reverse query method or an index-based KNN similarity search method.

In a second aspect, an embodiment of the present invention provides a similar domain name searching apparatus, including:

The acquisition module is used for acquiring the domain name to be checked;

The vectorization module is used for extracting text features of the domain name to be checked, vectorizing the text features and obtaining feature vectors of the domain name to be checked;

The encoding module is used for encoding the domain name feature vector to be checked and matching out a target domain name feature vector group from a preset full domain name database according to the encoding result, wherein the full domain name database comprises a plurality of domain name feature vector groups classified according to the domain name feature vector encoding result;

And the similarity searching module is used for calculating the distance between the domain name feature vector to be searched and each domain name feature vector in the target feature vector group, and obtaining a similar domain name of the domain name to be searched according to the distance.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements a similar domain name searching method as any one of the above when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a similar domain name lookup method as described in any of the above.

The embodiment of the invention provides a similar domain name searching method, a device and electronic equipment, wherein the similar domain name searching method vectorizes and encodes a domain name to be searched, a target domain name feature vector group is matched from a preset full-quantity domain name database according to a coding result, the distance between the domain name feature vector to be searched and each domain name feature vector in the target domain name feature vector group is calculated, the similar domain name of the domain name to be searched is determined according to a distance calculation result, the similarity between the domain names is prevented from being directly calculated to search the similar domain names, the similarity between the domain names is calculated, the similarity between the feature vectors is converted into comparison, particularly the calculation of the distance between the domain name feature vector to be searched and each domain name feature vector in the target domain name feature vector group is converted, the calculation difficulty is greatly reduced, the calculation speed is obviously improved, the method can also be applied to the similarity searching process of massive domain name data, and the execution efficiency of calculating and searching is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will make a brief description of the drawings required for the embodiments, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a first similar domain name searching method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a second similar domain name searching method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a third similar domain name searching method according to an embodiment of the present invention;

Fig. 4 is a schematic application diagram of a third similar domain name searching method according to an embodiment of the present invention;

fig. 5 is a flowchart of a fourth similar domain name searching method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a similar domain name searching device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Reference numerals:

601: an acquisition module; 602: a vectorization module; 603: a coding module; 604: a similarity searching module;

701: a processor; 702: a communication interface; 703: a memory; 704: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a similar domain name searching method, a similar domain name searching device and electronic equipment, wherein the method comprises the following steps:

Acquiring a domain name to be checked;

According to the similarity domain name searching method, the domain names to be searched are vectorized and coded, the target domain name feature vector group is matched from the preset full domain name database according to the coding result, the distance between the domain name feature vector to be searched and each domain name feature vector in the target domain name feature vector group is calculated, the similarity domain name of the domain names to be searched is determined according to the distance calculation result, similarity between the domain names is prevented from being directly calculated to search the similarity domain names, similarity between the domain names is calculated, similarity comparison between the feature vectors is converted, particularly calculation of the distance between the domain name feature vector to be searched and each domain name feature vector in the target domain name feature vector group is converted, calculation difficulty is greatly reduced, calculation speed is obviously improved, the method can be applied to a similarity searching process of massive domain name data, and calculation searching execution efficiency is effectively improved.

The following describes the similar domain name searching method provided by the embodiment of the invention in detail by referring to the attached drawings.

In a first aspect, an embodiment of the present invention provides a similar domain name searching method, and fig. 1 is a schematic flow chart of a first similar domain name searching method provided in the embodiment of the present invention, as shown in fig. 1, where the method includes:

s101, acquiring a domain name to be checked;

the method comprises the steps that a domain name to be checked is obtained, a DNS server obtains a query request input by a user, and the domain name to be checked is determined according to the user query request.

The query request may want to make similar domain name queries for multiple domain names. The multiple domains can be used as the domains to be searched one by one according to a certain arrangement order to search for similar domains. Multiple domains can be used as the domains to be searched simultaneously, and similar domain searching can be performed synchronously and crosswise by adopting a cross searching method.

The domain name to be queried is typically a new domain name that has not been resolved by the DNS server. Of course, the domain name to be checked may also be a domain name already resolved by the DNS server, i.e. a certain already resolved historical domain name stored in the historical domain name resolution database.

S102, extracting text features of the domain name to be checked, and vectorizing the text features to obtain feature vectors of the domain name to be checked;

Extracting text features of the domain name to be checked obtained in the step S101 through a domain name embedding algorithm, and vectorizing the extracted text features to obtain a domain name feature vector to be checked, wherein the domain name feature vector to be checked and the domain name to be checked form a data mapping relation corresponding to each other;

The domain name embedding algorithm (also called word embedding algorithm, embedding) is a popular method for language model and characterization learning technology in natural language processing technology (NLP technology). In particular, a high-dimensional character space with dimensions equal to the number of all words is embedded in a continuous vector space with dimensions much lower, each word or phrase being represented as a vector in the real number domain. In this embodiment, the domain name embedding algorithm may use a BERT domain name embedding algorithm, or a WORD2VEC domain name embedding algorithm, or a GLOVE domain name embedding algorithm. The application of the domain name embedding algorithm generally includes: the method comprises the following steps of artificial neural network calculation, word co-occurrence matrix space dimension reduction processing, probability model construction, word vectorization explicit representation according to the context of a phrase and the like. The effects of a grammar analyzer, text emotion analysis and the like in the NLP technology can be greatly improved by using the various domain name embedding algorithms, so that the domain name to be checked can be accurately vectorized to obtain the feature vector of the domain name to be checked, which is completely mapped with the domain name to be checked. The details are illustrated by using the BERT domain name embedding algorithm:

The BERT domain name embedding algorithm is a domain name embedding algorithm based on a BERT model, and the BERT model (Transformers's bi-directional encoder representation) is an NLP pre-training technology which has been developed and is effectively applied to a plurality of natural language processing typical tasks and successfully verified. Therefore, the BERT domain name embedding algorithm can be directly applied to the embodiment, and the specific application process is described as follows:

1) Dividing all domain name data of the domain name to be checked by taking points as dividing words, then taking the dividing words as text data, and inputting BERT words into an algorithm model;

2) Performing Transformers-based word embedding operation of bidirectional coding on text data in a model through a BERT domain name embedding algorithm, so as to realize text feature extraction on the domain name to be checked;

3) Vectorizing the extracted text features to obtain a domain name feature vector to be checked, and outputting the domain name feature vector; and the domain name feature vector to be checked and the domain name to be checked form a data mapping relation corresponding to each other.

Through the processing based on the BERT domain name embedding algorithm, the character space of the domain name to be checked is reduced to a standardized vector space with much lower continuous dimension, the domain name to be checked is converted into a domain name feature vector to be checked, which has a data mapping relation with the domain name to be checked, and the text feature of the domain name feature vector to be checked is maintained in the low-dimensional space. It should be noted that, at this time, the feature vector of the domain name to be checked is still a high-dimensional vector, and only the space dimension of the space where the feature vector is located is reduced, which can be understood that the processing based on the BERT domain name embedding algorithm converts the domain name character string of the domain name to be checked into a high-dimensional vector.

S103, encoding the domain name feature vector to be checked, and matching a target domain name feature vector group from a preset full domain name database according to the encoding result, wherein the full domain name database comprises a plurality of domain name feature vector groups classified according to the domain name feature vector encoding result;

And (3) encoding the domain name feature vector to be checked obtained in the step (S102), specifically, carrying out abstract processing and encoding processing on the domain name feature vector to be checked through ISAX encoding methods such as ISAX representation encoding method to obtain an encoding result aiming at the domain name feature vector to be checked, and matching a target domain name feature vector group from a preset full domain name database according to the encoding result aiming at the domain name feature vector to be checked.

The method comprises the steps of obtaining a full-quantity domain name database, wherein the full-quantity domain name database comprises a plurality of domain name feature vector groups classified according to domain name feature vector coding results, specifically, in the process of constructing the full-quantity domain name database in advance, abstract processing and coding processing are needed to be carried out on a plurality of domain name feature vectors in the full-quantity domain name database to obtain coding results aiming at the plurality of domain name feature vectors, then analyzing distribution features of all domain name feature vectors in the full-quantity domain name database at present according to the coding results aiming at the plurality of domain name feature vectors, classifying the plurality of domain name feature vectors according to different distribution features to obtain a plurality of domain name feature vector groups, and each domain name feature vector group comprises one or more domain name feature vectors.

The matching process of matching the target domain name feature vector group from a preset full-quantity domain name database according to the coding result of the domain name feature vector to be searched is based on an adaptive tree index mechanism established in advance for the domain name feature vector groups, namely, according to the coding result of the domain name feature vector to be searched, index searching is carried out from the domain name feature vector groups by utilizing the adaptive tree index mechanism, the domain name feature vector group which is the same as the coding result of the domain name feature vector to be searched is found, and the domain name feature vector group is used as the target domain name feature vector group matched with the domain name feature vector to be searched.

In the embodiment, in the process of carrying out similarity calculation and comparison between vectors, all vectors in a full data set do not need to be traversed and scanned sequentially, only a group of domain name feature vector groups matched with the domain name feature vector groups need to be found for the domain name feature vectors to be detected, the data of other group of domain name feature vector groups in the full domain name database are not affected at all, and the calculation and comparison process does not need to be participated, so that the calculation amount of the whole data is greatly reduced, and the calculation efficiency and the calculation speed can be improved. And by means of the self-adaptive tree index mechanism, the whole search space in the full domain name database is divided into a plurality of subspaces with different sizes, and the subspaces are connected in series through the self-adaptive tree indexes, so that when an index search request exists, a group of domain name feature vector groups matched with the domain name feature vector to be searched can be efficiently and rapidly searched from part of subspaces, the search response process is accelerated, and the search response time is shortened.

S104, calculating the distance between the domain name feature vector to be checked and each domain name feature vector in the target feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance.

And calculating the distance between the feature vector of the domain name to be checked obtained in the step S102 and each domain name feature vector in the target feature vector group matched in the step S103, and obtaining a similar domain name of the domain name to be checked according to the distance. Namely, the similarity between the domain name feature vector to be checked and each domain name feature vector in the group is compared by calculating the distance, one or more domain name feature vectors which are most similar to the domain name feature vector to be checked are judged according to the distance in each calculation result, and one or more similar domain names of the domain name to be checked are determined according to the most similar one or more domain name feature vectors.

According to the similarity domain name searching method provided by the embodiment of the invention, the to-be-searched domain name is represented as the to-be-searched domain name feature vector in a vectorization mode, the to-be-searched domain name feature vector group is matched from the preset full domain name database according to the coding result, the distance between the to-be-searched domain name feature vector and each domain name feature vector in the target domain name feature vector group is calculated, the similarity domain name of the to-be-searched domain name is determined according to the distance calculation result between the vectors, the similarity between the domains is prevented from being searched by directly calculating the similarity between the domains, the similarity between the domains is calculated, the similarity between the feature vectors is compared, and particularly the distance between the to-be-searched domain name feature vector and each domain name feature vector in the target domain name feature vector group is calculated, so that the calculation difficulty is greatly reduced, the calculation speed is obviously improved.

Fig. 2 is a flow chart of a second similar domain name searching method provided by the embodiment of the present invention, as shown in fig. 2, based on the embodiment shown in fig. 1, step S104 is to calculate a distance between the feature vector of the domain name to be searched and each domain name feature vector in the target feature vector group, and obtain a similar domain name of the domain name to be searched according to the distance, and specifically includes:

s1041, calculating Euclidean distance between the domain name feature vector to be checked and each domain name feature vector in the target feature vector group, wherein the Euclidean distance refers to the square sum of multidimensional coordinate differences of corresponding points between the domain name feature vector to be checked and one domain name feature vector;

Calculating the Euclidean distance (also called Euclidean distance, abbreviated as ED) between the domain name feature vector to be checked obtained in the step S102 and each domain name feature vector in the target feature vector group matched in the step S103, wherein the Euclidean distance refers to the square sum of multi-dimensional coordinate differences of corresponding points between the domain name feature vector to be checked and one domain name feature vector, and the specific calculation formula is as follows:

d＝sqrt(∑(xi1-xi2)^2)

where i represents the spatial dimension in which the feature vector is currently located, i=1, 2, … …, n; xi1 represents the i-th dimensional coordinates of the first point; xi2 represents the i-th dimensional coordinates of the second point;

S1042, obtaining the domain name corresponding to the domain name eigenvector with the shortest Euclidean distance, and taking the domain name as the similar domain name of the domain name to be checked.

Comparing the similarity between the feature vector of the domain name to be checked and each feature vector of the domain name in the group through the calculated Euclidean distances, judging one or more feature vectors of the domain name which are most similar to the feature vector of the domain name to be checked according to the size of the Euclidean distances, selecting one or more feature vectors of the domain name which are the smallest in Euclidean distances as one or more feature vectors of the domain name which are most similar to the feature vector of the domain name to be checked, and determining one or more similar domain names of the domain name to be checked according to the similarity.

The method and the device effectively convert the comparison of the similarity between the domain names into the comparison of the distance between the vectors, so that the comparison calculation process is more accurate, and the similar domain names of the domain names to be searched can be quickly searched accordingly.

On the basis of the embodiment shown in fig. 1 or on the basis of the embodiment shown in fig. 2, before encoding the domain name feature vector to be checked in step S103 and matching the target domain name feature vector group from a preset full domain name database according to the encoding result, the method further includes: s201, the step of constructing the full-scale domain name database is preferably described by taking the embodiment shown in fig. 2 as an example, and fig. 3 is a schematic flow chart of a third similar domain name searching method provided in the embodiment of the present invention, as shown in fig. 3, before step S103, step S201 is further included, and step S201 specifically includes:

S2011, acquiring all historical domain names in a historical domain name resolution database;

Acquiring all the historical domain names in the historical domain name resolution database, and acquiring all the historical domain names stored in the historical domain name resolution database by understanding all the resolution data of the DNS servers in the full-disk traversal scanning historical domain name resolution database. The last resolved IP address of each historical domain name can be obtained as the latest resolved IP address, and the historical resolution frequency information of each historical domain name can be obtained, so that the latest resolved IP address, the historical resolution frequency information and other data of each historical domain name can be used in the judging process of the malicious domain name in the later period, and no expanded description is made here.

In addition, the domain name data in the constructed full domain name database needs to be updated in real time, or a new domain name record needs to be inserted in the process of traversing the scanning history domain name resolution database. Specifically, firstly, reading all analysis data files in a daily historical domain name analysis database by using a timing task; and then, the newly added domain name is distributed to a distributed processing platform in a piece-by-piece analysis record through a Kafka message queue, and finally, the new domain name and analysis data are inserted into a full-quantity domain name database by using the distributed processing platform. Wherein the Kafka message queue is a large-scale publish/subscribe message queue specifically in accordance with the distributed transaction log architecture. Therefore, all the historical domain name data can be updated in real time, and the data basis of similar domain name searching is enriched.

S2012, extracting text features of all the historical domain names, and vectorizing the text features of all the historical domain names to obtain a plurality of domain name feature vectors, wherein each domain name feature vector corresponds to each historical domain name one by one;

Extracting text features of all the historical domain names obtained in the step S2011 through a domain name embedding algorithm, and vectorizing the extracted text features of all the historical domain names to obtain a plurality of domain name feature vectors, wherein each domain name feature vector corresponds to each historical domain name one by one to form a data mapping relation corresponding to each other;

The domain name embedding algorithm (also called word embedding algorithm, embedding) is a popular method for language model and characterization learning technology in natural language processing technology (NLP technology). In particular, a high-dimensional character space with dimensions equal to the number of all words is embedded in a continuous vector space with dimensions much lower, each word or phrase being represented as a vector in the real number domain. The application of the domain name embedding algorithm generally includes: the method comprises the following steps of artificial neural network calculation, word co-occurrence matrix space dimension reduction processing, probability model construction, word vectorization explicit representation according to the context of a phrase and the like. The above-mentioned various domain name embedding algorithms can greatly raise the effects of grammar analyzer and text emotion analysis in NLP technology so as to accurately implement vectorization representation of the domain name to be checked and obtain a domain name feature vector completely mapped with each historical domain name.

Through the processing based on the domain name embedding algorithm, the character space of each historical domain name is reduced in dimension to a standardized vector space with much lower continuous dimension, each historical domain name is converted into a domain name feature vector with a data mapping relation, so that a plurality of domain name feature vectors are obtained, and each domain name feature vector maintains the text feature of the domain name feature vector in the low-dimensional space. In this case, the domain name feature vector itself is still a high-dimensional vector, and only the spatial dimension of the space in which the domain name feature vector is located is reduced, which may be understood as a process based on a domain name embedding algorithm, which converts a domain name string of each history domain name into a high-dimensional vector.

S2013, encoding the domain name feature vectors, classifying the domain name feature vectors into domain name feature vector groups according to different encoding results, and constructing an adaptive tree index for the domain name feature vector groups, wherein each domain name feature vector group comprises at least one domain name feature vector.

Encoding the domain name feature vectors obtained in the step 2012, specifically performing abstract processing and encoding processing on the domain name feature vectors by ISAX encoding method, wherein ISAX encoding method is specifically ISAX representation encoding method, so as to obtain encoding results of the domain name feature vectors. According to different encoding results for a plurality of domain name feature vectors, analyzing distribution features of all domain name feature vectors in a current full-scale domain name database, classifying the plurality of domain name feature vectors according to the different distribution features, and classifying the plurality of domain name feature vectors into a plurality of domain name feature vector groups, wherein each domain name feature vector group comprises at least one domain name feature vector.

And also construct the adaptive tree index for the said multiple domain name eigenvector sets, it is a kind of adaptive tree index mechanism that is set up for the said multiple domain name eigenvector sets in advance, in order to carry on the index search to the said domain name to be checked subsequently. The single-thread vector indexing method is described herein as an example: the self-adaptive tree-shaped indexing mechanism divides the whole search space in the full-quantity domain name database into a plurality of sub-spaces with different sizes, wherein each domain name characteristic vector with the same coding result is positioned at the same sub-node of the index tree, and each domain name characteristic vector with different coding results is positioned at different sub-nodes of the index tree. The self-adaptive tree indexes are connected in series, so that when an index searching request exists, a group of domain name feature vector groups matched with the domain name feature vector to be searched can be efficiently and rapidly searched from part of subspaces (child nodes), all domain name feature vectors in the group are used as adjacent vectors of the domain name feature vector to be searched, and similar calculation is performed to calculate the adjacent vectors first, so that the searching response process is accelerated, and the searching response time is shortened.

On the basis of the embodiment shown in fig. 3, in step S2032, the text features of all the history domain names are extracted, and the text features of all the history domain names are vectorized and represented by a BERT domain name embedding algorithm;

or by WORD2VEC domain name embedding algorithm;

Or by the globe domain name embedding algorithm.

That is, in this embodiment, the domain name embedding algorithm may use the BERT domain name embedding algorithm, or use the WORD2VEC domain name embedding algorithm, or use the GLOVE domain name embedding algorithm to perform feature extraction and vectorization identification on all the historical domain names.

The details of the use of the BERT domain name embedding algorithm are illustrated herein:

The processing procedure of feature extraction and vectorization identification of all the historical domain names by adopting a WORD2VEC domain name embedding algorithm or a GLOVE domain name embedding algorithm is the same as the principle of the specific application procedure of the BERT domain name embedding algorithm, and can be adjusted and improved according to the actual requirements of actual application scenes, such as optimizing an algorithm model or adding data processing operations such as data screening, data cleaning and the like. The embodiment of the present invention is not limited in any way.

Fig. 4 is a schematic application diagram of a third similar domain name searching method according to the embodiment of the present invention, as shown in fig. 4, after a user inputs a query request, S101 obtains a domain name to be queried from the query request; s102, extracting text features of the domain name to be checked, and vectorizing the text features to obtain feature vectors of the domain name to be checked; said step S102 is not shown in fig. 4. S2011, acquiring all historical domain names in a historical domain name resolution database, and acquiring the historical domain names through a full-disk traversal scanning historical domain name resolution database to construct a full-volume domain name database; s2012, extracting text features of all the historical domain names stored in the full-quantity domain name database, and vectorizing the text features of all the historical domain names to obtain a plurality of domain name feature vectors, wherein each domain name feature vector corresponds to each historical domain name one by one; s2013, encoding the domain name feature vectors, classifying the domain name feature vectors into domain name feature vector groups according to different encoding results, and constructing an adaptive tree index for the domain name feature vector groups, wherein each domain name feature vector group comprises at least one domain name feature vector; in the step S2013, the classification of the plurality of domain name feature vectors is not shown in fig. 4, but the adaptive tree index created by the plurality of domain name feature vector sets is shown in fig. 4, and is shown in the triangle and the node index tree inside the triangle, the dark circle represents the child nodes with the same encoding result, and the light circle represents the child nodes with different encoding results; s103, coding the domain name feature vector to be checked, and matching a target domain name feature vector group from a preset full domain name database according to the coding result of the domain name feature vector to be checked, wherein the full domain name database comprises a plurality of domain name feature vector groups classified according to the domain name feature vector coding result; in fig. 4, the domain name feature vector at the child node represented by the dark circles, which is the same as the encoding result of the domain name feature vector to be checked, is quickly found out through the adaptive tree index to be used as the adjacent vector of the domain name feature vector to be checked; s104, calculating the distance between the feature vector of the domain name to be checked and each feature vector of the target feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance, namely, in FIG. 4, calculating Euclidean distances between the feature vector of the domain name to be checked and the feature vectors of the domain names of the child nodes in the dark circles respectively, and determining that the feature vectors of the domain name with the minimum distance are two, so that the historical domain names corresponding to the two feature vectors of the domain name to be checked are similar domain names of the domain name to be checked, and obtaining a query result that the two similar domain names are obtained. The entire similar domain name lookup process is very fast and efficient.

On the basis of the above embodiments, the tree index in step S2033 adopts a single line Cheng Xiangliang index method;

or, a multithreading parallel vector index method is adopted;

Or, a memory type vector indexing method is adopted.

In step S2033, the adaptive tree index adopts a single-thread vector indexing method, see the operation in step S2013 in the embodiment shown in fig. 3, which is not described herein in detail;

Or, when the multithreaded parallel vector indexing method is adopted, the method can be understood as adopting ParIS/ParIS +multithreaded parallel vector indexing technology, and the advantage architecture of multiple cores and multiple slots is effectively utilized, so that calculation indexes required by two programs are distributed and executed in parallel to construct and search answers. Furthermore, parIS utilize the Single Instruction Multiple Data (SIMD) function of modern CPUs to further parallelize the execution of individual instructions in each core. Overall, parIS/ParIS + can have good effect, in the index creation process, the CPU calculation time is covered by the I/O time completely, and the processing throughput approximates to the hard disk serial reading speed. The ParIS/ParIS + multithreaded parallel vector indexing system is 1 order of magnitude faster than the latest index scanning methods in the prior art and 3 orders of magnitude faster than the most advanced optimized serial scanning in the prior art in response to a domain name similarity lookup search request. Therefore, by utilizing ParIS/ParIS + multithread parallel vector index technology, the similarity searching response speed can be accelerated to a greater extent, so that similarity comparison can be rapidly carried out, similar domain names of the domain names to be searched can be searched, and the purpose of rapid similarity comparison can be achieved.

Or, when the adaptive tree index adopts the memory vector indexing method, the new memory vector indexing method formed by adding the tree indexing mechanism to the mature memory indexing technology can be understood. The method can optimize the processing process of vector indexes through a memory mechanism of the query system.

Fig. 5 is a flow chart of a fourth similar domain name searching method according to an embodiment of the present invention, where on the basis of the embodiment shown in fig. 1, the similar domain name searching method further includes:

s105, analyzing and judging whether the similar domain name is a malicious domain name or not through a malicious domain name vector reverse query method or an index-based KNN similarity search method.

The method comprises the steps of analyzing and judging whether the similar domain name is a malicious domain name or not through a malicious domain name vector reverse query method or an index-based KNN similarity search method based on domain name related data with relevance to a domain name to be queried, such as domain name related data based on historical domain name resolution times, records of IP addresses of each time the historical domain name is resolved, historical domain name resolution fault records and the like, specifically, comparing the domain name related data with sample data in a malicious domain name sample library, and if the similarity with a malicious domain name sample is high, identifying the similar domain name as belonging to the malicious domain name. Therefore, judgment and identification of the malicious domain name can be increased, the malicious domain name can be effectively screened, and the overall safety of the query system is improved.

It should be noted that the foregoing embodiments may be arbitrarily combined to form new embodiments, and the embodiments are specifically designed according to practical needs.

In a second aspect, an embodiment of the present invention provides a similar domain name searching device, where the device is a device for executing the similar domain name searching method described in the foregoing embodiments, and specific application principles may refer to detailed descriptions of the foregoing method embodiments, which are not repeated herein. Fig. 6 is a schematic structural diagram of a similar domain name searching device according to an embodiment of the present invention, as shown in fig. 6, where the device includes:

an obtaining module 601, configured to obtain a domain name to be checked;

The vectorization module 602 is configured to extract text features of the domain name to be checked, and vectorize the text features to obtain a feature vector of the domain name to be checked;

The encoding module 603 is configured to encode the domain name feature vector to be checked, and match a target domain name feature vector group from a preset full-scale domain name database according to an encoding result, where the full-scale domain name database includes a plurality of domain name feature vector groups classified according to a domain name feature vector encoding result;

And the similarity searching module 604 is configured to calculate a distance between the feature vector of the domain name to be searched and each domain name feature vector in the target feature vector group, and obtain a similar domain name of the domain name to be searched according to the distance.

According to the similar domain name searching device provided by the embodiment of the invention, the vectorization module 602 represents the vectorization of the domain name to be searched obtained by the obtaining module 601 as the feature vector of the domain name to be searched, the encoding module 603 encodes the feature vector of the domain name to be searched, the target domain name feature vector group is matched from the preset full-scale domain name database according to the encoding result, the similar searching module 604 calculates the distance between the feature vector of the domain name to be searched and each domain name feature vector in the target domain name feature vector group, the similar domain name of the domain name to be searched is determined according to the distance calculation result between the vectors, and the modules are mutually matched, so that the whole device avoids directly calculating the similarity between the domain names to search the similar domain name, and calculates the similarity between the domain names to be converted into the similarity between the feature vectors, in particular to calculate the distance between the feature vector of the domain name to be searched and each domain name feature vector in the target domain name feature vector group, the calculation difficulty of calculation is greatly reduced, the calculation speed is obviously improved, the method can also be applied to the similarity searching process of massive domain name data, and the execution efficiency of calculation is effectively improved.

In a third aspect, an embodiment of the present invention provides an electronic device, and fig. 7 is a schematic structural diagram of the electronic device provided in the embodiment of the present invention, as shown in fig. 7, where the electronic device includes: a processor (processor) 701, a communication interface (Communications Interface) 702, a memory (memory) 703 and a communication bus 704, wherein the processor 701, the communication interface 702 and the memory 703 communicate with each other through the communication bus 704. The processor 701 may call logic instructions of a computer program in the memory 703 to perform a similar domain name lookup method comprising:

Acquiring a domain name to be checked;

Further, the logic instructions in the memory 703 may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the similar domain name searching method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements a similar domain name lookup method as described above, the method comprising:

Acquiring a domain name to be checked;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for searching for similar domain names, comprising:

Acquiring a domain name to be checked;

calculating the distance between the domain name feature vector to be checked and each domain name feature vector in the target domain name feature vector group, and obtaining a similar domain name of the domain name to be checked according to the distance;

Before encoding the domain name feature vector to be checked and matching the target domain name feature vector group from a preset full domain name database according to the encoding result, the method further comprises the following steps: the step of constructing the full-scale domain name database specifically comprises the following steps:

The method comprises the steps of encoding the domain name feature vectors, classifying the domain name feature vectors into domain name feature vector groups according to different encoding results, and constructing an adaptive tree index for the domain name feature vector groups, wherein each domain name feature vector group comprises at least one domain name feature vector, the tree index adopts a single-line Cheng Xiangliang index method, or the tree index adopts a multithreading parallel vector index method, or the tree index adopts a memory type vector index method.

2. The method for searching for similar domain names according to claim 1, wherein the calculating a distance between the feature vector of the domain name to be searched and each of the feature vectors of the domain name of the target feature vector set, and obtaining the similar domain name of the domain name to be searched according to the distance, specifically comprises:

Calculating Euclidean distance between the domain name feature vector to be checked and each domain name feature vector in the target domain name feature vector group, wherein the Euclidean distance refers to the square sum of multidimensional coordinate differences of corresponding points between the domain name feature vector to be checked and one domain name feature vector;

3. The method for searching for similar domain names according to claim 1, wherein the extracting text features of all the historical domain names and vectorizing the text features of all the historical domain names are performed by a BERT domain name embedding algorithm;

or by WORD2VEC domain name embedding algorithm;

Or by the globe domain name embedding algorithm.

4. The method of claim 1, wherein encoding the plurality of domain name feature vectors comprises:

5. The method for searching for similar domain names according to claim 1, further comprising:

6. A similar domain name lookup apparatus, comprising:

The acquisition module is used for acquiring the domain name to be checked;

the similarity searching module is used for calculating the distance between the domain name feature vector to be searched and each domain name feature vector in the target domain name feature vector group, and obtaining a similar domain name of the domain name to be searched according to the distance;

The similar domain name lookup apparatus further includes a module for performing the steps of:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the similar domain name lookup method as claimed in any one of claims 1 to 5 when executing the computer program.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a similar domain name lookup method as claimed in any of claims 1 to 5.