CN106803035A

CN106803035A - A kind of password conjecture set creation method and password cracking method based on username information

Info

Publication number: CN106803035A
Application number: CN201611079933.XA
Authority: CN
Inventors: 陈小军; 徐睿; 时金桥; 谭建龙; 文新; 胡兰兰; 王颖冰; 于晓杰
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2017-06-06

Abstract

The invention discloses a password guessing set generation method and a password cracking method based on user name information. The password cracking method of the present invention is as follows: 1) perform word segmentation and semantic structure labeling respectively on the user name and password in the leaked data training set, and calculate the semantic similarity of the user name and password; wherein, the semantic similarity includes the semantic structure similarity and semantic fragment similarity; 2) apply the semantic similarity to the PCFGs grammar, that is, construct the PCFGs grammar based on the semantic similarity; 3) generate the password guessing set according to the descending order of probability according to the PCFGs grammar constructed in step 2); 4) according to The set of password guesses is used for password cracking. The invention utilizes the fragment similarity and structural similarity of the user name and the password to understand the composition semantics of the password, thereby generating a password guessing set and improving the efficiency of password cracking.

Description

A password guessing set generation method and password cracking method based on user name information

技术领域technical field

本发明涉及一种基于用户名信息的密码猜测集生成方法及密码破解方法。The invention relates to a password guessing set generation method and a password cracking method based on user name information.

背景技术Background technique

长时间以来，破解密码使用传统的暴力破解方法，但这种方法没有对密码进行深入的分析，以至于效果和效率都不尽如人意。For a long time, traditional brute force cracking methods have been used to crack passwords, but this method does not conduct in-depth analysis of passwords, so that the effect and efficiency are not satisfactory.

在一些新方法中，自然语言处理的思想和工具被应用到密码分析和破解当中。这种方法将密码也视为某种形式的自然语句，由一系列片段按照一定的层次结构组合而成。出现在密码中的片段通常是字典中的单词、日期或者其他有意义的字符串，而这些片段的组合结构常常体现出某些固定模式。可以使用NLTK(Natural Language Toolkit)以及WordNet工具对密码进行分词、词性标注以及语义类别标注。然后，利用自然语言处理中概率上下文无关文法(Probability Context-Free Grammars，PCFGs)学习生成密码的语法规则，并按照概率降序生成密码猜测集。但是，当被攻击网站含有较多弱密码时，这种方法被证明破解效率较低。另外，当被用于破解中文网站密码时，该方法破解效率也较差，原因在于其分词系统并不能有效为中文拼音分词。In some new approaches, ideas and tools from natural language processing are applied to cryptanalysis and cracking. This method regards passwords as some form of natural sentences, which are composed of a series of fragments according to a certain hierarchical structure. Fragments that appear in passwords are usually words, dates, or other meaningful strings in the dictionary, and the combined structure of these fragments often reflects certain fixed patterns. You can use NLTK (Natural Language Toolkit) and WordNet tools to perform word segmentation, part-of-speech tagging, and semantic category tagging on passwords. Then, Probability Context-Free Grammars (PCFGs) in natural language processing are used to learn the grammatical rules for generating passwords, and generate password guess sets in descending order of probability. However, when the attacked website contains many weak passwords, this method proves to be less efficient. In addition, when it is used to crack Chinese website passwords, the cracking efficiency of this method is also poor, because its word segmentation system cannot effectively segment Chinese pinyin.

这种方法的主要问题在于未能充分分析密码中的语义内容以及各语义类别之间的语法，也未能给所使用的字典中的单词分配合适的概率。The main problem with this approach is that it fails to adequately analyze the semantic content of the cipher and the syntax between semantic categories, and fails to assign appropriate probabilities to the words in the dictionary used.

发明内容Contents of the invention

本发明的目的是将自然语言处理领域的思想和工具运用到密码分析和破解领域中来，对用户名分解分析，提取片段和结构特征，利用用户名和密码的片段相似性和结构相似性，理解密码的构成语义，加快密码破解速度，是一种基于用户名信息的密码猜测集生成方法及密码破解方法。The purpose of the present invention is to apply the ideas and tools in the field of natural language processing to the field of cryptanalysis and cracking, decompose and analyze the user name, extract fragments and structural features, and use the fragment similarity and structural similarity of the user name and password to understand The composition semantics of the password, which speeds up the speed of password cracking, is a password guessing set generation method and a password cracking method based on user name information.

为了利用用户名中包含的信息来提高密码破解效率，本发明提供了一种基于PCFGs并能提取用户名密码间语义相似性的密码猜测生成器，简称基于语义相似性的密码猜测生成器。In order to use the information contained in the username to improve the efficiency of password cracking, the present invention provides a password guessing generator based on PCFGs and capable of extracting semantic similarity between usernames and passwords, referred to as a password guessing generator based on semantic similarity.

本发明的技术方案为：Technical scheme of the present invention is:

一种基于用户名信息的密码猜测集生成方法，其步骤为：A password guessing set generation method based on user name information, the steps of which are:

1)对泄漏数据训练集中的用户名、密码分别进行分词和语义结构标注，计算用户名、密1) Segment the user name and password in the leaked data training set and mark the semantic structure respectively, and calculate the user name and password.

码的语义相似性；其中，所述语义相似性包括语义结构相似性和语义片段相似性；The semantic similarity of code; Wherein, described semantic similarity comprises semantic structure similarity and semantic fragment similarity;

2)将该语义相似性应用到PCFGs语法中，即基于语义相似性构建PCFGs语法；2) Apply the semantic similarity to the PCFGs grammar, that is, construct the PCFGs grammar based on the semantic similarity;

3)根据步骤2)构建的PCFGs语法，按照概率降序生成密码猜测集。3) According to the PCFGs grammar constructed in step 2), password guessing sets are generated in descending order of probability.

一种基于用户名信息的密码破解方法，其步骤为：A password cracking method based on user name information, the steps of which are:

1)对泄漏数据训练集中的用户名、密码分别进行分词和语义结构标注，计算用户名、密码的语义相似性；其中，所述语义相似性包括语义结构相似性和语义片段相似性；1) Carry out word segmentation and semantic structure labeling respectively to the username and password in the leaked data training set, and calculate the semantic similarity of username and password; wherein, the semantic similarity includes semantic structural similarity and semantic fragment similarity;

3)根据步骤2)构建的PCFGs语法，按照概率降序生成密码猜测集；3) According to the PCFGs grammar constructed in step 2), password guessing sets are generated in descending order of probability;

4)根据该密码猜测集进行密码破解。4) Perform password cracking according to the password guessing set.

进一步的，基于语义相似性构建PCFGs语法的方法为：根据用户名、密码的语义结构相似性，得到不同语义结构的用户名所选用的分布不同的密码结构，将密码结构作为PCFGs语法的非终端结构；根据用户名、密码的语义片段相似性，选取用户名中的语义片段加入到用来生成密码的PCFGs语法的终端词集合中，得到PCFGs语法的终端词集合。Further, the method of constructing PCFGs grammar based on semantic similarity is as follows: according to the semantic structure similarity of username and password, the password structure with different distributions selected by usernames with different semantic structures is obtained, and the password structure is used as the non-terminal structure of PCFGs grammar ; According to the similarity of the semantic segment of the user name and password, select the semantic segment in the user name and add it to the terminal word set of the PCFGs grammar used to generate the password, and obtain the terminal word set of the PCFGs grammar.

进一步的，对于密码中的片段，如果该片段出现在泄漏数据训练集的用户名中，则将该片段在泄漏数据训练集中的频数乘以一个概率系数α，并将扩大α倍的频数累加到所述终端词集合中该片段原有的频数上作为该片段的新频数；若所述终端词集合中不含该片段，则将该片段及其频数信息一起加入到所述终端词集合中；然后更新所述终端词集合中终端词的概率分布。Further, for a segment in the password, if the segment appears in the user name of the leaked data training set, the frequency of the segment in the leaked data training set is multiplied by a probability coefficient α, and the frequency expanded by α times is accumulated to The original frequency of the segment in the terminal word set is used as the new frequency of the segment; if the segment is not contained in the terminal word set, the segment and its frequency information are added together in the terminal word set; Then update the probability distribution of the terminal words in the terminal word set.

进一步的，所述步骤3)的实现方法为：为每一非终端结构建立一个优先级队列，该优先级队列用于存储对应的非终端结构按概率降序生成的密码猜测；然后对所有优先级队列的第一个元素进行遍历，找出概率最大的密码，将该密码出队列输出到密码猜测集，再进行下一次密码查找，直到密码猜测集中密码数量达到规定值。Further, the implementation method of step 3) is: a priority queue is established for each non-terminal structure, and the priority queue is used to store the password guesses generated by the corresponding non-terminal structure in descending order of probability; The first element of the queue is traversed to find the password with the highest probability, and the password is dequeued and output to the password guessing set, and then the next password lookup is performed until the number of passwords in the password guessing set reaches the specified value.

进一步的，对用户名、密码按照语义类别分词和语义结构标注；其中，所述语义类别包括拼音姓名、拼音姓名缩写、拼音名、拼音姓、拼音短语、其他拼音、英文短语、英文姓名、英文单词、其他字母、数字日期、其他数字、单个字符重复、字符串重复、键盘等间距跳跃、键盘上同一行字符相邻、键盘上不同行字符相邻和其他特殊符号。Further, the user name and password are marked according to the semantic category word segmentation and semantic structure; wherein, the semantic category includes pinyin name, pinyin abbreviation, pinyin name, pinyin surname, pinyin phrase, other pinyin, English phrase, English name, English Words, other letters, numeric dates, other numbers, single character repetition, string repetition, keyboard equidistant jumps, adjacent characters on the same line of the keyboard, adjacent characters on different lines of the keyboard, and other special symbols.

进一步的，所述语义片段相似性的衡量指标包括#1至#7七项衡量指标；其中，#1指标表示用户名中含有所述语义类别；#2指标表示用户密码中含有该语义类别；#3指标表示用户名和密码中该语义类别的字符内容完全相同；#4指标代表在用户名和密码中该语义类别的字符内容相同但有大小写区别；#5指标代表用户名字符是密码字符的子串；#6指标代表密码是用户名的子串；#7指标代表满足#2指标却不满足#3至#6指标。Further, the measure index of the similarity of the semantic segment includes #1 to #7 seven measure indexes; wherein, the #1 index indicates that the user name contains the semantic category; the #2 index indicates that the user password contains the semantic category; Index #3 indicates that the characters of the semantic category in the username and password are exactly the same; indicator #4 indicates that the characters of the semantic category in the username and password are the same but have case differences; indicator #5 indicates that the characters in the username are characters in the password substring; #6 indicator means that the password is a substring of the user name; #7 indicator means that #2 indicator is satisfied but #3 to #6 indicators are not met.

进一步的，所述泄漏数据训练集选自于互联网公开泄漏数据集。Further, the leaked data training set is selected from public leaked data sets on the Internet.

本发明主要包含两个方面：(1)首先对互联网公开泄漏数据集进行用户名密码相似性的分析和提取；(2)改进PCFGs算法，将用户名和密码间语义结构和片段相似性利用在密码猜测生成中，并按概率严格降序生成密码猜测集。The present invention mainly comprises two aspects: (1) at first carry out the analysis and the extraction of user name password similarity to Internet public leakage data set; (2) improve PCFGs algorithm, utilize semantic structure and fragment similarity between user name and password in password Guesses are being generated, and password guesses are generated in strict descending order of probability.

通常情况下，泄露的密码集中通常还伴随着其他的一些用户信息的泄露，比如用户名。用户名被看做代表用户在网络世界中身份的一种抽象符号，它和对应的密码一起构成了网络世界中保护用户隐私的一道屏障。用户名的创建可以体现出用户在创建网络口令时的一些习惯和倾向。在对泄露密码集进行分析后我们发现，中文在线社区用户在用户名和密码中都倾向于使用拼音和数字，而其中拼音姓名在用户名中出现的十分普遍。例如:一位用户的用户名是“xiaoming19860805”，密码是“xm0805”(xm是xiaoming的拼音缩写)。这位用户在其用户名中使用了拼音姓名，并且将对应缩写用在了密码中。用户名中出现了日期“19860805”，而密码中则使用了部分日期“0805”。而从结构上来说，该用户的用户名和密码都使用了拼音姓名后加日期的模式。由于一组用户名和密码是由同一个用户所创建，这种习惯和倾向很可能也被使用在密码的创建中，所以深入分析用户名的构成可以帮助我们更好的了解密码的创建过程，从而更有效的进行密码破解。Usually, the leaked password set is usually accompanied by the leak of some other user information, such as username. The user name is regarded as an abstract symbol representing the user's identity in the online world, and together with the corresponding password, it constitutes a barrier to protect the user's privacy in the online world. The creation of user names can reflect some habits and tendencies of users when creating network passwords. After analyzing the leaked password set, we found that Chinese online community users tend to use pinyin and numbers in their usernames and passwords, and pinyin names are very common in usernames. For example: a user whose username is "xiaoming19860805" and whose password is "xm0805" (xm is the pinyin abbreviation of xiaoming). This user uses a pinyin name in their username and the corresponding abbreviation in their password. The date "19860805" appears in the username, while part of the date "0805" is used in the password. Structurally speaking, both the user name and password of this user use the pattern of pinyin name followed by date. Since a set of usernames and passwords are created by the same user, this habit and tendency is likely to be used in the creation of passwords, so in-depth analysis of the composition of usernames can help us better understand the password creation process, thereby More effective password cracking.

本发明以PCFG语法为基础，提出了能够提取并利用用户名和密码间语义结构和片段相似性的密码生成算法。该发明密码破解方法流程如图1，其包括以下内容：Based on the PCFG grammar, the present invention proposes a password generation algorithm capable of extracting and utilizing the semantic structure and fragment similarity between user names and passwords. The password cracking method flow chart of this invention is shown in Figure 1, and it comprises the following contents:

1)对用户名、密码分别进行了分词和语义结构标注，给出了用户名与密码语义相似性的衡量指标。定义了两种语义相似性：语义结构相似性和语义片段相似性；1) Word segmentation and semantic structure annotation are carried out for username and password, respectively, and the measurement index of semantic similarity between username and password is given. Two types of semantic similarity are defined: semantic structure similarity and semantic fragment similarity;

2)基于语义相似性构建PCFGs语法，严格按照概率降序生成密码猜测集；2) Construct PCFGs grammar based on semantic similarity, and generate password guess sets strictly in descending order of probability;

基于大规模公开数据集上的实验结果，证明了提出的基于语义相似性的密码猜测生成器的有效性。Based on the experimental results on large-scale public datasets, the effectiveness of the proposed password guess generator based on semantic similarity is proved.

针对中文拼音、英文单词、数字和特殊符号定义了18种语义类别(如表1)。基于Contemporary Corpus of American English语料集、紫光拼音输入法拼音字典、韦氏英文大字典以及31个提取数字日期的正则表达式对用户名和密码进行分词，之后对分词结果进行了语义类别标注。Eighteen semantic categories are defined for Chinese Pinyin, English words, numbers and special symbols (see Table 1). Based on the Contemporary Corpus of American English corpus, Ziguang Pinyin Input Method Pinyin Dictionary, Webster's English Dictionary, and 31 regular expressions for extracting digital dates, the user name and password were segmented, and then the semantic category of the word segmentation results was marked.

表1为18种语义类别及其简化表示Table 1 shows the 18 semantic categories and their simplified representations

语义相似性分析：针对语义结构相似性，对用户名按照语义结构分组，统计不同组用户其密码语义结构的选择情况，得到不同用户名密码结构的概率分布。进一步对用户名及密码进行语义片段相似性分析，定义了#1至#7七种衡量语义片段相似性的指标，按照每个指标计算其相应的比例见表2；其中，#1指标表示用户名中含有左栏语义类别，其比例为符合指标的用户数量占所有用户数量的百分比，#2指标表示用户密码中也含有该语义类别，其比例为符合指标的用户数量占用户名中含有该语义类别用户数量的百分比。对满足#2指标的用户，即用户名和密码中都含有相应的语义类别，#3至#7四个指标对其用户名和密码内容进行了进一步的相似性分析。#3指标表示用户名和密码中该语义类别的字符内容完全相同，其比例为符合指标的用户数量占满足#2指标用户数量的百分比；#4指标代表在用户名和密码中该语义类别的字符内容只有大小写的区别，其比例为符合指标的用户数量占满足#2指标用户数量的百分比；#5指标代表用户名字符是密码字符的子串，其比例为符合指标的用户数量占满足#2指标用户数量的百分比；#6指标反之，代表密码是用户名的子串，其比例为符合指标的用户数量占满足#2指标用户数量的百分比；#7指标代表满足#2指标却不满足#3至#6指标，其比例为符合指标的用户数量占满足#2指标用户数量的百分比。Semantic similarity analysis: For the similarity of semantic structure, user names are grouped according to the semantic structure, and the selection of password semantic structure of different groups of users is counted to obtain the probability distribution of different user name password structures. Further analyze the semantic segment similarity of the user name and password, and define seven indicators #1 to #7 to measure the similarity of the semantic segment, and calculate the corresponding ratio according to each indicator, as shown in Table 2. The name contains the semantic category in the left column, and its ratio is the percentage of the number of users who meet the index to the total number of users. The #2 index indicates that the user password also contains this semantic category, and the ratio is the number of users who meet the index. Percentage of the number of semantic category users. For users who meet the #2 indicator, that is, the user name and password contain corresponding semantic categories, the four indicators #3 to #7 carry out further similarity analysis on the content of their user name and password. Index #3 indicates that the character content of the semantic category in the username and password is exactly the same, and its ratio is the percentage of users meeting the indicator to the number of users meeting the indicator #2; indicator #4 represents the character content of the semantic category in the username and password There is only a difference in uppercase and lowercase, and the ratio is the percentage of the number of users who meet the index to the number of users who meet the #2 index; #5 index means that the user name character is a substring of the password character, and the ratio is the number of users who meet the index account for the number of users who meet #2 The percentage of the number of target users; the opposite of the #6 indicator means that the password is a substring of the user name, and the ratio is the percentage of the number of users who meet the indicator to the number of users who meet the #2 indicator; #7 indicates that the #2 indicator is met but not # For indicators 3 to #6, the ratio is the percentage of the number of users who meet the indicator to the number of users who meet the indicator #2.

表2为泄漏数据集语义片段相似性的7种指标分析Table 2 shows the analysis of seven indicators for the similarity of semantic fragments in the leaked dataset

本发明基于语义相似性的密码猜测生成器主要包括以下几个步骤：The password guessing generator based on semantic similarity of the present invention mainly comprises the following steps:

1)对用户名和密码都按照语义类别分词，得到用户名和密码的结构和片段，分别统计用户名的结构和片段、密码的结构和片段的频数后除以各自的总数计算概率。1) Segment the user name and password according to the semantic category to obtain the structure and fragments of the user name and password, respectively count the structure and fragments of the user name, the frequency of the structure and fragments of the password, and divide by the respective totals to calculate the probability.

2)将语义相似性应用到PCFGs语法中：针对语义结构相似性，训练得到不同语义结构的用户名所选用的分布不同的密码结构，在生成密码猜测时，按照待破解的用户名结构调整为其对应的密码结构，得到PCFGs的非终端结构；针对语义片段相似性，将步骤1得到的用户名的语义片段与密码的语义片段进行合并，并对用户名中的终端词概率设置可调节的权重参数，得到PCFGs的终端词集合；权重参数是实验得出来的，比如遍历1到500，选取最好实验效果的参数。2) Apply semantic similarity to PCFGs grammar: aiming at semantic structure similarity, train users with different semantic structures to select password structures with different distributions. When generating password guesses, adjust it according to the username structure to be cracked. According to the corresponding password structure, the non-terminal structure of PCFGs is obtained; for the semantic segment similarity, the semantic segment of the user name obtained in step 1 is merged with the semantic segment of the password, and an adjustable weight is set for the terminal word probability in the user name parameter, to get the terminal word set of PCFGs; the weight parameter is obtained through experiments, such as traversing 1 to 500, and selecting the parameter with the best experimental effect.

3)在PCFGs语法建立完成之后，下一步工作即按概率降序生成密码猜测集。本发明设计了一种新的密码猜测生成函数Generating_Guesses。主要思想是为每一个非终端结构建立一个优先级队列(priority queue)，该优先级队列可为对应的非终端结构严格按概率降序生成密码猜测并储存，在最终生成密码猜测时，对所有优先级队列的第一个元素(即该队列中概率最大的密码)进行遍历，找出概率最大的密码，将该密码出队列输出到密码猜测集，再进行下一次密码查找，直到密码猜测集中密码数量达到规定值，猜测集便构建完成。3) After the PCFGs grammar is established, the next step is to generate password guess sets in descending order of probability. The present invention designs a new password guessing generating function Generating_Guesses. The main idea is to establish a priority queue for each non-terminal structure. This priority queue can generate and store password guesses strictly in descending order of probability for the corresponding non-terminal structures. When finally generating password guesses, all priority The first element of the super-level queue (that is, the password with the highest probability in the queue) is traversed to find the password with the highest probability, and the password is dequeued and output to the password guessing set, and then the next password search is performed until the password guessing set is set When the number reaches the specified value, the guess set is constructed.

上述步骤(2)“将语义相似性应用到PCFGs语法中”是本发明的核心。下面将对语义结构相似性和语义内容相似性的应用进行详细描述。The above step (2) "applying semantic similarity to PCFGs grammar" is the core of the present invention. The application of semantic structure similarity and semantic content similarity will be described in detail below.

语义结构相似性是指：不同用户名结构的用户在密码结构选择上存在差异，使用相同用户名结构的用户倾向于选择相似的密码结构；用户在其用户名和密码间倾向于使用相同的语义类别，甚至用户名和密码有着完全相同的语义结构。在进行交叉网站攻击时，攻击者拥有泄露网站A的用户名和密码，以及被攻击网站B的用户名。所以为应用语义结构相似性，在A数据集上训练完成的PCFGs应利用B网站的用户名信息，对用户名推导规则和概率做修正：用户名非终端结构概率分布应在B网站用户名数据上进行训练。而由用户名非终端结构推导出密码的非终端结构概率分布，以及密码的终端词语概率分布则在数据集A上完成。因为无法获取B网站的密码信息，做这种训练时基于这样的语义相似性结论：使用相同用户名结构的用户倾向于选择相似的密码结构。则A网站某类用户其密码的使用情况能够一定程度上反映B网站该类用户密码的使用情况。例如：将在12306数据集上训练好的PCFGs记为PCFGs|Dist.(US_12306)，利用12306的泄露数据集信息对CSDN进行攻击时，用户名非终端结构概率分布，用户名终端词语的概率分布应该使用CSDN用户名数据重新计算。所得到的新PCFGs记为PCFGs|dist(US_CSDN)，流程见图2。Semantic structure similarity means: users with different username structures have differences in password structure selection, users with the same username structure tend to choose similar password structures; users tend to use the same semantic category between their username and password , even username and password have exactly the same semantic structure. When conducting a cross-site attack, the attacker has the username and password of compromised website A, and the username of attacked website B. Therefore, in order to apply the semantic structure similarity, the PCFGs trained on the A data set should use the user name information of the B website to modify the user name derivation rules and probabilities: the probability distribution of the non-terminal structure of the user name should be in the user name data of the B website on for training. The non-terminal structure probability distribution of the password derived from the non-terminal structure of the user name and the probability distribution of the terminal words of the password are completed on the data set A. Because the password information of website B cannot be obtained, this kind of training is based on the semantic similarity conclusion: users who use the same username structure tend to choose a similar password structure. Then the use of passwords of certain types of users on website A can reflect the use of passwords of such users on website B to a certain extent. For example: record the PCFGs trained on the 12306 data set as PCFGs|Dist.(US_12306), and use the leaked data set information of 12306 to attack CSDN, the probability distribution of the non-terminal structure of the user name, and the probability distribution of the terminal words of the user name It should be recalculated using CSDN username data. The new PCFGs obtained are denoted as PCFGs|dist(US_CSDN), and the flow chart is shown in Figure 2.

定义PW_i＝＜ps_i1，ps_i2，...＞。PW_i代表用户名结构为us_i的用户所使用的密码结构的集合。根据语义结构相似性，相同用户名结构的用户倾向于选择相似的密码结构，可以认定用户名结构为us_i的用户其密码结构服从us_i的密码结构分布ps_i。则若用户名结构us_i在被攻击网站用户名中使用频率高于训练集中频率，相关的密码结构ps_ij将被赋予更高的概率。具体的密码结构概率计算公式如下(以12306攻击CSDN为例)：Define PW _i =<ps _i1 , ps _i2 , . . . >. PW _i represents a collection of password structures used by users whose username structure is us _i . According to the semantic structure similarity, users with the same username structure tend to choose a similar password structure, and it can be determined that the password structure of the user whose username structure is us _i obeys the password structure distribution ps _{i of us i} _. Then if the user name structure us _i is used more frequently in the username of the attacked website than in the training set, the related password structure ps _ij will be given a higher probability. The specific password structure probability calculation formula is as follows (taking 12306 attacking CSDN as an example):

F(ps_j)|dist(US_CSDN)＝∑_iF(us_i)*F(ps_j|(US_12306，us_i)) (1)F(ps _j )|dist(US _CSDN )＝∑ _i F(us _i )*F(ps _j |(US_12306，us _i )) (1)

在公式(1)中，F(.)表示概率函数，F(ps_j|(US_12306，us_i))表示在12306数据集中用户名结构为us_i的用户群，选择密码结构ps_j的概率(即优先级队列中非终端结构的概率)，ps_j为全局的密码结构分布。In formula (1), F(.) represents the probability function, F(ps _j |(US_12306, us _i )) represents the user group whose username structure is us _i in the 12306 data set, and the probability of choosing the password structure ps _j ( That is, the probability of the non-terminal structure in the priority queue), ps _j is the global password structure distribution.

语义片段相似性是指用户倾向于在用户名和密码中使用相同或者相似的语义内容。首先，将训练集用户名中的语义片段词，其频数乘以一个概率系数α，并将扩大α倍的频数累加到密码终端词集合∑集中该词原有的频数上作为该词的新频数。若密码终端词集合∑中不含该语义片段词，则该词作为一个新的终端词，连同频数信息一起加入到∑中。α值越大，代表密码终端词集合Σ中该词被增加的频数越大，导致在生成密码时该词的频率被扩大。上述操作使得密码终端词集合Σ词语数量和频数都产生了变化，之后进行概率归一化可得到最终集合Σ中终端词的概率分布。Semantic fragment similarity means that users tend to use the same or similar semantic content in usernames and passwords. First, multiply the frequency of the semantic segment words in the user name in the training set by a probability coefficient α, and add the frequency expanded by α times to the original frequency of the word in the password terminal word set ∑ set as the new frequency of the word . If the semantic segment word is not included in the password terminal word set Σ, then the word is added to Σ together with the frequency information as a new terminal word. The larger the α value, the greater the frequency of the word in the password terminal word set Σ, which results in the expansion of the frequency of the word when generating the password. The above operations change the number and frequency of words in the password terminal word set Σ, and then carry out probability normalization to obtain the probability distribution of terminal words in the final set Σ.

与现有技术相比，本发明的效果：Compared with prior art, effect of the present invention:

为评估本发明借助用户名密码语义相似性的破解方法，我们与Weir构建的传统PCFG语法以及目前主流密码破解软件John the Ripper在不同数据集上进行了对比实验。实验数据集来源包含CSDN、新浪微博、嘟嘟牛、178、7k7k、17173泄露账号密码集共计五千一百万账户。由于计算机性能限制，每个实验中的密码猜测次数限定为1亿次。In order to evaluate the cracking method of the present invention based on the semantic similarity of username and password, we conducted comparative experiments on different data sets with the traditional PCFG grammar constructed by Weir and the current mainstream password cracking software John the Ripper. The sources of the experimental data set include CSDN, Sina Weibo, Duduniu, 178, 7k7k, and 17173 leaked account and password sets, totaling 51 million accounts. Due to computer performance limitations, the number of password guesses in each experiment was limited to 100 million.

图3给出了语义结构相似性对密码破解效率的影响。所有方法都采用CSDN泄露密码集作为训练集，并对新浪微博密码集进行攻击。在前1千万次猜测中，虽然最终猜出的密码比例相近，本发明方法Sim_PCFG_on_LD的猜测成功率增长速度明显高于Weir的PCFG(Weir_PCFG_on_LD)以及John the Ripper软件(JtR_with_Mangling)。要猜出31.3％的密码，方法JtR_with_Mangling、Weir_PCFG_on_LD以及本发明Sim_PCFG_on_LD所需的猜测次数依次为：1.73*10⁶，2.32*10⁶以及1.30*10⁶。本实验说明，使用了语义结构相似性的方法能够提高密码破解的速度。Figure 3 shows the impact of semantic structure similarity on password cracking efficiency. All methods use the CSDN leaked password set as the training set, and attack the Sina Weibo password set. In the first 10,000,000 guesses, although the ratio of finally guessed passwords is similar, the guessing success rate growth rate of the method Sim_PCFG_on_LD of the present invention is obviously higher than Weir's PCFG (Weir_PCFG_on_LD) and John the Ripper software (JtR_with_Mangling). To guess 31.3% of the passwords, the number of guesses required by the methods JtR_with_Mangling, Weir_PCFG_on_LD and Sim_PCFG_on_LD of the present invention are: 1.73*10 ⁶ , 2.32*10 ⁶ and 1.30*10 ⁶ . This experiment shows that the method using semantic structure similarity can improve the speed of password cracking.

图5为比较概率系数α的实验结果图，同样采用CSDN泄露密码集作为训练集，并对新浪微博密码集进行攻击。选取α值为1、4、15和500的结果进行展示。从图中可以看出，当α值为4时，在效率和效果方面都具有最好的性能。Figure 5 is the graph of the experimental results comparing the probability coefficient α. The CSDN leaked password set is also used as the training set, and the Sina Weibo password set is attacked. Select the results with α values of 1, 4, 15 and 500 for display. It can be seen from the figure that when the value of α is 4, it has the best performance in terms of efficiency and effect.

图4为语义片段相似性对密码破解效率影响的实验图。17173和新浪微博的泄露密码集分别作为被攻击的目标，所有方法分别在其它四个数据集上做训练后对17173及新浪微博进行交叉攻击。从针对17173进行的攻击图中可以看出，根据所选训练集的不同，利用了两种语义相似性的方法Sim_PCFG_on_LDU4比Weir的PCFG方法猜测成功的密码数提高了大约14.0％至25.5％，比John the Ripper提高了大约55.3％至81.6％。在对新浪微博进行的四个攻击中，方法Sim_PCFG_on_LDU4在三个攻击里的破解效率都优于Weir和John theRipper。例外发生在用嘟嘟牛网站泄露的密码做训练集的攻击中。在对比了嘟嘟牛和新浪微博的泄露密码集之后发现，二者数据集高达有88.8％的相同用户名和密码。如此高的重合度使得Weir的方法在该攻击中效率较高，因为Weir的PCFG方法相较方法Sim_PCFG_on_LDU4所生成的不同于训练集的新密码数量较少，从而密码猜测集的分布跟原始训练集更相近。Figure 4 is an experimental diagram of the influence of semantic segment similarity on password cracking efficiency. The leaked password sets of 17173 and Sina Weibo were targeted respectively, and all methods were trained on the other four data sets to conduct cross-attacks on 17173 and Sina Weibo. As can be seen from the attack graph for 17173, depending on the selected training set, Sim_PCFG_on_LDU4, which utilizes two semantic similarity methods, increases the number of successful password guesses by about 14.0% to 25.5% compared with Weir's PCFG method, which is higher than John the Ripper improved roughly 55.3 percent to 81.6 percent. Among the four attacks on Sina Weibo, the cracking efficiency of the method Sim_PCFG_on_LDU4 is better than that of Weir and John the Ripper in all three attacks. The exception occurs in the attack that uses the password leaked from the Duduniu website as the training set. After comparing the leaked password sets of Duduniu and Sina Weibo, it was found that the two datasets had 88.8% of the same usernames and passwords. Such a high degree of coincidence makes Weir's method more efficient in this attack, because Weir's PCFG method generates fewer new passwords different from the training set than the method Sim_PCFG_on_LDU4, so that the distribution of the password guessing set is similar to the original training set more similar.

附图说明Description of drawings

图1为本发明的密码破解方法流程图；Fig. 1 is the flowchart of password cracking method of the present invention;

图2为基于语义相似性的PCFG(Sim_PCFG)在12306数据集上的训练过程；Figure 2 is the training process of PCFG (Sim_PCFG) based on semantic similarity on the 12306 data set;

图3为语义片段相似性对密码破解效率的影响；Figure 3 is the impact of semantic segment similarity on password cracking efficiency;

图4为不同训练集对17173和新浪微博进行攻击Figure 4 shows different training sets attacking 17173 and Sina Weibo

(a)训练集178与测试集17173，(b)训练集7k7k与测试集17173，(c)训练集csdn与测试集17173，(d)训练集dodonew与测试集17173，(e)训练集178与测试集sinaweibo，(f)训练集7k7k与测试集sinaweibo，(g)训练集csdn与测试集sinaweibo，(h)训练集dodonew与测试集sinaweibo；(a) training set 178 and test set 17173, (b) training set 7k7k and test set 17173, (c) training set csdn and test set 17173, (d) training set dodonew and test set 17173, (e) training set 178 and test set sinaweibo, (f) training set 7k7k and test set sinaweibo, (g) training set csdn and test set sinaweibo, (h) training set dodonew and test set sinaweibo;

图5为不同概率系数α值的性能比较。Figure 5 shows the performance comparison of different probability coefficient α values.

具体实施方式detailed description

以CSDN泄漏密码库作为训练集，12306泄漏数据库作为目标集，LDU4方法(概率系数α值为4)为例：Take the CSDN leaked password library as the training set, the 12306 leaked database as the target set, and the LDU4 method (the probability coefficient α value is 4) as an example:

1)从CSDN库提取密码的结构S₁和片段T₁，从12306库提取用户名结构S₂和片段T₂；1) Extract password structure S ₁ and fragment T ₁ from the CSDN database, extract user name structure S ₂ and fragment T ₂ from the 12306 database;

2)将S₂添加到S₁中，S₁直接加上S₂中每个结构的频数，再全局统计每个结构的概率；2) Add S ₂ to S ₁ , S ₁ directly adds the frequency of each structure in S ₂ , and then globally counts the probability of each structure;

3)将T₂添加到T₁中，T₁直接加上T₂中每个终端词的频数乘以4，再全局统计每个终端词的概率；3) Add T ₂ to T ₁ , T ₁ directly adds the frequency of each terminal word in T ₂ multiplied by 4, and then globally counts the probability of each terminal word;

4)将Contemporary Corpus of American English语料集、紫光拼音输入法拼音字典、韦氏英文大字典等字典添加到T₁中。此时，生成好了PCFGs的S和T，再按概率从大到小进行排序。4) Add the Contemporary Corpus of American English corpus, Ziguang Pinyin Input Method Pinyin Dictionary, Webster's English Dictionary and other dictionaries to T ₁ . At this point, the S and T of PCFGs are generated, and then sorted according to the probability from large to small.

S的结果举例如下：An example of the result of S is as follows:

T的结果举例如下：An example of the results of T is as follows:

5)生成全局优先级队列，计算每种结构选取最大概率终端词组合之后的概率，队列中最大的为概率最高的密码猜测。全局优先级队列中最大概率的结构为“K_CONTINUOUS”，最大概率的终端词为“123456789”，输出的第一个密码猜测的概率为0.0567154276067*0.217089432985＝0.012312320020632619。接着按照概率降序依次输出密码猜测。5) Generate a global priority queue, and calculate the probability of selecting the highest probability terminal word combination for each structure, and the largest in the queue is the password guess with the highest probability. The structure with the highest probability in the global priority queue is "K_CONTINUOUS", the terminal word with the highest probability is "123456789", and the probability of guessing the first output password is 0.0567154276067*0.217089432985=0.012312320020632619. Then output password guesses in descending order of probability.

6)生成1亿个密码猜测集，并同时对目标集12306进行比对，查看是否能命中，每隔10000个密码猜测输出次数和命中率作为实验结果。如：6) Generate 100 million password guessing sets, and compare the target set 12306 at the same time to see if it can be hit, and output the number of times and hit rate every 10,000 password guesses as the experimental results. Such as:

10000,0.15337433306310000,0.153374333063

20000,0.17706768086420000,0.177067680864

……...

99990000,0.53821527841199990000,0.538215278411

100000000,0.538219720953。100000000,0.538219720953.

Claims

1. a kind of password based on username information guesses set creation method, and its step is：

1) participle and semantic structure mark are carried out respectively to the user name in leak data training set, password, user name, close is calculated The Semantic Similarity of code；Wherein, the Semantic Similarity includes semantic structure similitude and semantic segment similitude；

2) Semantic Similarity is applied in PCFGs grammers, i.e., PCFGs grammers is built based on Semantic Similarity；

3) according to step 2) build PCFGs grammers, according to probability descending generation password conjecture collection.

2. a kind of password cracking method based on username information, its step is：

3) according to step 2) build PCFGs grammers, according to probability descending generation password conjecture collection；

4) guess that collection carries out password cracking according to the password.

3. method as claimed in claim 1 or 2, it is characterised in that the method that PCFGs grammers are built based on Semantic Similarity For：According to user name, the semantic structure similitude of password, the distribution obtained selected by the user name of different semantic structures is different Cryptography architecture, using cryptography architecture as PCFGs grammers nonterminal structure；According to user name, the semantic segment similitude of password, The semantic segment chosen in user name is added in the terminal set of words of the PCFGs grammers for generating password, obtains PCFGs languages The terminal set of words of method.

4. method as claimed in claim 3, it is characterised in that for the fragment in password, if the fragment appears in leakage In the user name of data training set, then the frequency by the fragment in leak data training set is multiplied by a probability coefficent α, and will Expand α times of frequency and be added to new frequency in the terminal set of words in the original frequency of the fragment as the fragment；If institute State in terminal set of words without the fragment, then the fragment and its frequency information are added in the terminal set of words together；So The probability distribution of terminal word in the terminal set of words is updated afterwards.

5. method as claimed in claim 3, it is characterised in that the step 3) implementation method be：It is each nonterminal knot Build and found a priority query, the priority query is used to store the password that corresponding nonterminal structure is generated by probability descending Conjecture；Then first element to all priority queries is traveled through, and finds out the password of maximum probability, and the password is gone out into team Arrange output and guess collection to password, then carry out password lookup next time, until password conjecture concentrates password quantity to reach setting.

6. method as claimed in claim 1 or 2, it is characterised in that to user name, password according to semantic classes participle and semanteme Structure is marked；Wherein, the semantic classes include phonetic name, phonetic initials, phonetic name, phonetic surname, phonetic phrase, its His phonetic, English phrase, english name, English word, other letters, alphanumeric data, other digital, single character repetition, words Symbol string is repeated, keyboard equidistantly jumps, same line character is adjacent on keyboard, different line characters are adjacent on keyboard and other special symbols Number.

7. method as claimed in claim 6, it is characterised in that the measurement index of the semantic segment similitude includes #1 to #7 Seven measurement indexs；Wherein, the semantic classes is contained in #1 index expressions user name；Contain in #2 index expression user ciphers The semantic classes；The character content of the semantic classes is identical in #3 index expression username and passwords；#4 indexs are represented The character content of the semantic classes is identical in username and password but has capital and small letter to distinguish；It is close that #5 indexs represent user name character The substring of code character；#6 indexs represent the substring that password is user name；The representative of #7 indexs meets #2 indexs and is but unsatisfactory for #3 to #6 Index.

8. method as claimed in claim 1 or 2, it is characterised in that the leak data training set is disclosed selected from internet Leak data collection.