CN112579713B

CN112579713B - Address identification method, device, computing equipment and computer storage medium

Info

Publication number: CN112579713B
Application number: CN201910935761.9A
Authority: CN
Inventors: 姜荣鑫
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Liaoning Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Liaoning Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2023-11-21
Anticipated expiration: 2039-09-29
Also published as: CN112579713A

Abstract

Embodiments of the present invention relate to the field of data processing technology and disclose an address identification method, device, computing device and computer storage medium. The method includes: segmenting the acquired address information to obtain a segmentation result; and determining where the address information is based on the segmentation result. The first transition state in the probabilistic finite state machine; calculate the first probability of the address information in the probabilistic finite state machine according to the first transition state; if the first probability is less than the threshold, determine whether the address information contains typos; if the address information contains For typos, correct the typos and obtain the second transition state of the corrected address information in the probabilistic finite state machine; calculate the second probability of the address information in the probabilistic finite state machine based on the second transition state; if the second probability Greater than or equal to the threshold, the address information is determined to be a valid address. Through the above method, the embodiment of the present invention realizes the recognition of the address information input by the user through the probabilistic finite state machine.

Description

Address identification method, device, computing equipment and computer storage medium

技术领域Technical field

本发明实施例涉及数据处理技术领域，具体涉及一种地址识别方法、装置、计算设备及计算机存储介质。Embodiments of the present invention relate to the field of data processing technology, and specifically relate to an address identification method, device, computing device and computer storage medium.

背景技术Background technique

地址搜索广泛应用于手机地图应用软件、地图网站、导航软件等领域。现有的地址识别方法多使用确定型有限状态机的方式，确定型有限状态机包括多个状态节点和连接状态节点的有向弧，每一条有向弧上标注有两个状态节点之间的迁移条件，确定型有限状态机的结构示意图如图1所示，其中，S₀至S₃表示状态节点，I₀至I₆表示外部输入，O₀至O₆表示状态节点与外部输入作用后得到的输出结果，该输出结果用于表示状态之间的迁移。例如，有限状态机的当前状态节点为S₁，当接收到一个外部输入I₀时，产生一个输出O₀，该输出用于指示状态迁移至S₀。Address search is widely used in mobile map application software, map websites, navigation software and other fields. Existing address identification methods mostly use a deterministic finite state machine. The deterministic finite state machine includes multiple state nodes and directed arcs connecting the state nodes. Each directed arc is marked with the distance between the two state nodes. Migration conditions, the structural diagram of the deterministic finite state machine is shown in Figure 1, where S ₀ to S ₃ represent state nodes, I ₀ to I ₆ represent external inputs, and O ₀ to O ₆ represent the state nodes after the interaction with external inputs. The resulting output is used to represent the transition between states. For example, the current state node of the finite state machine is S ₁ . When an external input I ₀ is received, an output O ₀ is generated. This output is used to indicate the state transition to S ₀ .

在使用确定型有限状态机进行地址识别时，输入的地址与确定型有限状态机完全匹配时才会认为输入的地址为有效地址，即，如果一条地址能够从状态机的起始状态节点经过若干个中间状态节点到达终止状态节点，则认为该地址有效，如果输入的地址中包含错别字，则该地址不能从起始状态节点到达终止状态节点，则认为输入的地址无效。When using a deterministic finite state machine for address identification, the input address will be considered a valid address only when it completely matches the deterministic finite state machine. That is, if an address can pass through a number of times from the starting state node of the state machine, If an intermediate state node reaches the end state node, the address is considered valid. If the input address contains typos, the address cannot reach the end state node from the start state node, and the input address is considered invalid.

由于各地的地址没有形成统一的命名，用户输入的地址信息中经常会包含错别字，在这种情况下，通过确定型有限状态机识别地址时，会被认为地址无效，无法识别。Since addresses in various places do not have unified naming, the address information input by users often contains typos. In this case, when the address is recognized through the deterministic finite state machine, the address will be considered invalid and cannot be recognized.

发明内容Contents of the invention

鉴于上述问题，本发明实施例提供了一种地址识别方法、装置、计算设备及计算机存储介质，克服了上述问题或者至少部分地解决了上述问题。In view of the above problems, embodiments of the present invention provide an address identification method, device, computing device and computer storage medium, which overcome the above problems or at least partially solve the above problems.

根据本发明实施例的一个方面，提供了一种地址识别方法，所述方法包括：According to an aspect of an embodiment of the present invention, an address identification method is provided, which method includes:

获取用户输入的地址信息；Get the address information entered by the user;

对所述地址信息进行分词，得到分词结果；Perform word segmentation on the address information to obtain word segmentation results;

根据所述分词结果确定所述地址信息在概率有限状态机中的第一迁移状态；Determine the first transition state of the address information in the probabilistic finite state machine according to the word segmentation result;

根据所述第一迁移状态计算所述地址信息在所述概率有限状态机中的第一概率；Calculate the first probability of the address information in the probabilistic finite state machine according to the first transition state;

若所述第一概率小于预设阈值，根据所述第一迁移状态确定所述地址信息中是否包含错别字；If the first probability is less than the preset threshold, determine whether the address information contains a typo according to the first migration state;

若所述地址信息中包含错别字，则对所述错别字进行纠错，得到纠错后的地址信息在所述概率有限状态机中的第二迁移状态；If the address information contains typos, correct the typos to obtain the second transition state of the corrected address information in the probabilistic finite state machine;

根据所述第二迁移状态计算所述地址信息在所述概率有限状态机中的第二概率；Calculate the second probability of the address information in the probabilistic finite state machine according to the second transition state;

若所述第二概率大于或等于所述预设阈值，确定用户输入的所述地址信息为有效地址。If the second probability is greater than or equal to the preset threshold, it is determined that the address information input by the user is a valid address.

在一种可选的方式中，若所述地址信息中包含错别字，则对所述错别字进行纠错，得到纠错后的地址信息在所述概率有限状态机中的第二迁移状态，包括：In an optional manner, if the address information contains typos, the typos are corrected to obtain the second transition state of the corrected address information in the probabilistic finite state machine, including:

将所述第一迁移状态中包含的非地址节点对应的文本信息确定为纠错词；Determine the text information corresponding to the non-address node contained in the first migration state as the error correction word;

将所述纠错词转换为拼音；Convert the error-corrected words into pinyin;

根据所述拼音在预设的错别字词库中查找与所述拼音相匹配的纠错词对象；Search for error correction word objects that match the pinyin in a preset misspelled word library according to the pinyin;

将所述纠错词对象中搜索次数最高的词作为纠错对比对象；Use the word with the highest number of searches among the error correction word objects as the error correction comparison object;

计算所述纠错词的搜索次数与所述纠错对比对象的搜索次数之间的比重；Calculate the ratio between the number of searches for the error correction word and the number of searches for the error correction comparison object;

如果所述比重小于预设比重，则将所述纠错词替换为所述纠错对比对象；If the proportion is less than the preset proportion, replace the error correction word with the error correction comparison object;

将替换后的所述地址信息在所述概率有限状态机中的迁移状态确定为所述第二迁移状态。The transition state of the replaced address information in the probabilistic finite state machine is determined as the second transition state.

在一种可选的方式中，若所述地址信息中不包含错别字，所述方法还包括：In an optional manner, if the address information does not contain typos, the method further includes:

确定所述状态节点的迁移顺序；Determine the migration sequence of the state nodes;

当所述迁移顺序与预设顺序不一致时，将所述迁移顺序调整至与所述预设顺序一致；When the migration sequence is inconsistent with the preset sequence, adjust the migration sequence to be consistent with the preset sequence;

根据调整后的所述迁移状态计算所述地址信息在所述概率有限状态机中的第三概率；Calculate the third probability of the address information in the probabilistic finite state machine according to the adjusted migration state;

若所述第三概率大于或等于所述预设阈值，确定用户输入的所述地址信息为有效地址。If the third probability is greater than or equal to the preset threshold, it is determined that the address information input by the user is a valid address.

在一种可选的方式中，在获取用户输入的地址信息之前，所述方法还包括：In an optional manner, before obtaining the address information input by the user, the method further includes:

获取用户历史输入的地址信息，得到训练样本；Obtain the address information input by the user historically and obtain training samples;

从所述训练样本提取有限状态机的状态节点；Extract state nodes of the finite state machine from the training samples;

确定所述训练样本在所述状态节点之间的迁移路径；Determine the migration path of the training sample between the state nodes;

根据所述迁移路径，通过隐马尔可夫模型计算相邻的所述状态节点之间的迁移概率；According to the migration path, calculate the migration probability between adjacent state nodes through a hidden Markov model;

将包含所述状态节点之间的迁移概率的有限状态机作为概率有限状态机。The finite state machine containing the transition probability between the state nodes is regarded as a probabilistic finite state machine.

在一种可选的方式中，对所述地址信息进行中文分词，得到分词结果，包括：In an optional method, Chinese word segmentation is performed on the address information to obtain word segmentation results, including:

对所述地址信息进行原子切分，得到多个单字；Perform atomic segmentation on the address information to obtain multiple single words;

将相邻的单字按照不同合并方式合并，得到第一分词；Merge adjacent words according to different merging methods to obtain the first participle;

将所述第一分词与预设词关联表匹配，得到分词结果，所述分词结果中的所有分词合并后得到所述地址信息。The first word segmentation is matched with a preset word association table to obtain a word segmentation result, and all the word segments in the word segmentation result are combined to obtain the address information.

根据本发明实施例的另一方面，提供了一种地址识别装置，所述装置包括：According to another aspect of the embodiment of the present invention, an address identification device is provided, and the device includes:

获取模块，用于获取用户输入的地址信息；The acquisition module is used to obtain the address information input by the user;

分词模块，用于对所述地址信息进行分词，得到分词结果；The word segmentation module is used to segment the address information and obtain the word segmentation results;

第一确定模块，用于根据所述分词结果确定所述地址信息在概率有限状态机中的第一迁移状态；A first determination module, configured to determine the first transition state of the address information in the probabilistic finite state machine according to the word segmentation result;

第一计算模块，用于根据所述第一迁移状态计算所述地址信息在所述概率有限状态机中的第一概率；A first calculation module, configured to calculate the first probability of the address information in the probabilistic finite state machine according to the first transition state;

第二确定模块，用于当所述第一概率小于预设阈值时，根据所述第一迁移状态确定所述地址信息中是否包含错别字；a second determination module, configured to determine whether the address information contains typos according to the first migration state when the first probability is less than a preset threshold;

纠错模块，用于当所述地址信息中包含错别字时，对所述错别字进行纠错，得到纠错后的地址信息在所述概率有限状态机中的第二迁移状态；An error correction module, configured to correct typos when the address information contains typos, and obtain the second transition state of the corrected address information in the probabilistic finite state machine;

第二计算模块，用于根据所述第二迁移状态计算所述地址信息在所述概率有限状态机中的第二概率；A second calculation module, configured to calculate the second probability of the address information in the probabilistic finite state machine according to the second transition state;

第三确定模块，用于当所述第二概率大于或等于所述预设阈值，确定用户输入的所述地址信息为有效地址。A third determination module, configured to determine that the address information input by the user is a valid address when the second probability is greater than or equal to the preset threshold.

根据本发明实施例的又一方面，提供了一种计算设备，包括：处理器、存储器、通信接口和通信总线，所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信；According to another aspect of the embodiment of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus. The processor, the memory, and the communication interface complete each other through the communication bus. communication between;

所述存储器用于存放至少一可执行指令，所述可执行指令使所述处理器执行一种地址识别方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform an operation corresponding to an address identification method.

根据本发明实施例的又一方面，提供了一种计算机存储介质，所述存储介质中存储有至少一可执行指令，所述可执行指令使处理器执行一种地址识别方法对应的操作。According to yet another aspect of an embodiment of the present invention, a computer storage medium is provided. At least one executable instruction is stored in the storage medium. The executable instruction causes the processor to perform an operation corresponding to an address identification method.

本发明实施例通过将用户输入的地址信息进行分词，得到分词结果，根据分词结果确定地址信息在概率有限状态机中的迁移状态，概率有限状态机包括状态节点和状态节点之间的迁移概率，迁移状态包含了地址信息对应的状态节点和各状态节点之间的迁移概率，根据该迁移状态对应的迁移概率计算用户输入的地址信息在概率有限状态机中的概率，当该概率小于预设阈值时，说明用户输入的地址信息中可能包含错别字，则对地址信息中包含的错别字进行纠错，并根据纠错后的地址信息确定用户输入的地址信息是否是有效地址。通过本发明实施例，实现了根据概率有限状态机对地址信息进行识别，并且能够在用户输入的地址信息中包含错别字时，对地址信息进行有效识别，提高了用户体验。The embodiment of the present invention obtains the word segmentation result by segmenting the address information input by the user, and determines the migration state of the address information in the probabilistic finite state machine based on the word segmentation result. The probabilistic finite state machine includes state nodes and the migration probability between state nodes. The migration state includes the state node corresponding to the address information and the migration probability between each state node. According to the migration probability corresponding to the migration state, the probability of the address information input by the user in the probabilistic finite state machine is calculated. When the probability is less than the preset threshold When , it means that the address information entered by the user may contain typos, then the typos contained in the address information are corrected, and based on the corrected address information, it is determined whether the address information entered by the user is a valid address. Through the embodiments of the present invention, the identification of address information based on a probabilistic finite state machine is realized, and when the address information input by the user contains typos, the address information can be effectively identified, thereby improving the user experience.

上述说明仅是本发明实施例技术方案的概述，为了能够更清楚了解本发明实施例的技术手段，而可依照说明书的内容予以实施，并且为了让本发明实施例的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the embodiments of the present invention. In order to have a clearer understanding of the technical means of the embodiments of the present invention, they can be implemented according to the content of the description, and in order to achieve the above and other purposes, features and The advantages can be more clearly understood, and specific embodiments of the present invention are listed below.

附图说明Description of the drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be construed as limiting the invention. Also throughout the drawings, the same reference characters are used to designate the same components. In the attached picture:

图1示出了确定型有限状态机的结构示意图；Figure 1 shows a schematic structural diagram of a deterministic finite state machine;

图2示出了本发明第一实施例提供的一种地址识别方法的流程图；Figure 2 shows a flow chart of an address identification method provided by the first embodiment of the present invention;

图3示出了本发明第一实施例提供的一种地址识别方法中概率有限状态机的结构示意图；Figure 3 shows a schematic structural diagram of a probabilistic finite state machine in an address identification method provided by the first embodiment of the present invention;

图4示出了本发明第二实施例提供的一种地址识别方法的流程图；Figure 4 shows a flow chart of an address identification method provided by the second embodiment of the present invention;

图5示出了本发明第三实施例的一种地址识别方法的流程图；Figure 5 shows a flow chart of an address identification method according to the third embodiment of the present invention;

图6示出了本发明第四实施例的一种地址识别装置的功能框图；Figure 6 shows a functional block diagram of an address identification device according to the fourth embodiment of the present invention;

图7示出了本发明第五实施例中的一种计算设备的结构示意图。Figure 7 shows a schematic structural diagram of a computing device in the fifth embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本发明，并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a thorough understanding of the invention, and to fully convey the scope of the invention to those skilled in the art.

本发明实施例的应用场景包括但不限于地图应用软件。当本发明实施例应用于地图应用软件时，用户在地图应用软件的搜索框中输入需要查询的地址信息。用户需要查询的地址信息对用户来说可能是一个陌生的地址信息，因此，用户输入的地址信息中可能包括错别字，而用户未察觉输入的地址信息中包含错别字，此时，如果地图应用软件加载的地址识别方法是基于传统的确定型有限状态机的方法，则地图应用软件会提示用户输入的地址无效，用户需要重新输入地址信息，影响用户体验。因此，本申请提出了一种地址识别方法，在用户输入的信息中包含错别字的情况下也能够对地址进行识别。下面以地图应用软件的应用场景为例，通过各具体实施例对本申请的总体构思做进一步说明。Application scenarios of embodiments of the present invention include but are not limited to map application software. When the embodiment of the present invention is applied to map application software, the user inputs the address information to be queried in the search box of the map application software. The address information that the user needs to query may be unfamiliar to the user. Therefore, the address information entered by the user may include typos, and the user does not realize that the entered address information contains typos. At this time, if the map application software loads If the address identification method is based on the traditional deterministic finite state machine method, the map application software will prompt the user that the address entered is invalid and the user needs to re-enter the address information, which affects the user experience. Therefore, this application proposes an address identification method that can identify addresses even when the information input by the user contains typos. The following takes the application scenario of map application software as an example to further explain the overall concept of the present application through specific embodiments.

图2示出了本发明第一实施例的一种地址识别方法的流程图，如图2所示，该方法包括以下步骤：Figure 2 shows a flow chart of an address identification method according to the first embodiment of the present invention. As shown in Figure 2, the method includes the following steps:

步骤110：获取用户输入的地址信息。Step 110: Obtain the address information input by the user.

在本步骤中，用户输入的地址信息是用户的兴趣点(Point of Interest，POI)，POI通常是一个具体的地址信息，例如，门牌地址。用户在地图应用软件的地址搜索框中输入查询的地址信息，并点击查询按钮，地图应用软件首先获取用户查询的省份及城市，如果用户输入的地址信息中包含了省份及城市，则根据用户输入的地址信息确定省份及城市；如果用户输入的地址信息中不包含省份及城市信息，则根据地图应用软件中的定位模块获取与用户输入的地址信息相关的参数，以确定用户查询信息所在的省份及城市。其中，与用户输入的地址信息相关的参数包括用户输入的地址信息的经纬度以及该地址信息的查询频率，其中，经纬度可以通过定位搜索引擎获取。将用户输入的地址信息对应的省份及城市与用户输入的地址信息合并，得到一条标准的地址信息，其中，标准的地址信息是预先设定的一条地址信息，该地址信息按顺序包含有表1中的地址元素，按照下表1由上到下的顺序，地址元素的等级依次降低。In this step, the address information input by the user is the user's Point of Interest (POI). The POI is usually a specific address information, such as a house number address. The user enters the queried address information in the address search box of the map application software and clicks the query button. The map application software first obtains the province and city queried by the user. If the address information entered by the user includes the province and city, it will The address information entered by the user determines the province and city; if the address information entered by the user does not contain province and city information, the parameters related to the address information entered by the user are obtained according to the positioning module in the map application software to determine the province where the user's query information is located. and cities. The parameters related to the address information input by the user include the longitude and latitude of the address information input by the user and the query frequency of the address information, where the longitude and latitude can be obtained through a positioning search engine. Merge the provinces and cities corresponding to the address information input by the user with the address information input by the user to obtain a standard address information. The standard address information is a preset address information. The address information includes Table 1 in order. The address elements in , according to the order from top to bottom in Table 1 below, the levels of the address elements decrease in sequence.

表1Table 1

步骤120：对地址信息进行分词，得到分词结果。Step 120: Perform word segmentation on the address information to obtain word segmentation results.

在本步骤中，对地址信息进行分词时，如果用户输入的地址信息中包含了省份和城市的信息，直接对用户输入的地址信息进行分词，如果用户输入的地址信息中不包含省份和城市信息，则将用户输入的信息转换为标准的地址信息之后，对标准的地址信息进行分词。在一种实施方式中，本步骤具体包括以下步骤：对地址信息进行原子切分，得到多个单字；将相邻的单字按照不同合并方式合并，得到第一分词；将第一分词与预设词关联表匹配，得到分词结果，该分词结果中的所有分词合并后得到地址信息。In this step, when segmenting the address information, if the address information input by the user contains province and city information, segment the address information input by the user directly. If the address information input by the user does not contain province and city information, , then after converting the information input by the user into standard address information, the standard address information is segmented. In one implementation, this step specifically includes the following steps: perform atomic segmentation on the address information to obtain multiple words; merge adjacent words according to different merging methods to obtain the first word segmentation; combine the first word segmentation with the preset The word association table is matched to obtain the word segmentation result, and all the word segments in the word segmentation result are combined to obtain the address information.

对地址信息进行原子切分，得到多个单字。其中，原子切分是按照字符对地址信息进行切分，以实现最小粒度的分词，例如“辽宁省沈阳市浑南区”，进行原子切分后，得到的结果为“辽”、“宁”、“省”、“沈”、“阳”、“市”、“浑”、“南”、“区”几个汉字字符。原子切分包括中文字符切分、英文字符切分以及数字切分。在对英文字符切分时，相邻的英文单词之间存在停顿，根据该停顿进行切分；在对数字进行切分时，一个单独的数字作为一个字符进行切分，例如，“新隆街6号A Square”，则在进行切分时，切分为“新”、“隆”、“街”、“6”、“号”、“A”、“Square”几个字符。Perform atomic segmentation on the address information to obtain multiple words. Among them, atomic segmentation is to segment the address information according to characters to achieve the smallest granularity of word segmentation. For example, "Hunnan District, Shenyang City, Liaoning Province", after atomic segmentation, the results obtained are "Liao" and "Ning" , "province", "shen", "yang", "city", "hun", "nan" and "district" several Chinese characters. Atomic segmentation includes Chinese character segmentation, English character segmentation and number segmentation. When segmenting English characters, there is a pause between adjacent English words, and segmentation is performed based on the pause; when segmenting numbers, a single number is segmented as a character, for example, "Xinlong Street" "No. 6 A Square", when segmented, it will be segmented into several characters such as "新", "龙", "街", "6", "号", "A", and "Square".

将相邻的单字合并时，根据每一个单字在核心词库中的状态进行合并，核心词库中存储有单字和单词的状态，该状态用于表示当前的单字或者单词能否继续与其他相邻字符合并。例如，在核心词库中，用1表示单字可以与相邻字符合并，形成词组，2表示词组仍然可以与其他字符合并形成词组，3表示词组无法与其他字符继续合并形成词组。例如，“辽宁省沈阳市浑南区”，进行原子切分后，得到的结果为“辽”、“宁”、“省”、“沈”、“阳”、“市”、“浑”、“南”、“区”几个汉字字符，在进行合并时，“辽”的状态标识是1，可以继续与“宁”合并，得到“辽宁”，“辽宁”在核心词库中的状态标识为2，可以继续与相邻字符合并，得到“辽宁省”，该词组在核心词库中的状态标识为3，不可以继续合并，则最终得到“辽宁省”一词，则将该词确定为第一分词。When merging adjacent words, they are merged according to the status of each word in the core vocabulary. The core vocabulary stores the status of words and words. This status is used to indicate whether the current word or word can continue to be related to other words. Merge adjacent characters. For example, in the core thesaurus, 1 indicates that a single character can be combined with adjacent characters to form a phrase, 2 indicates that the phrase can still be combined with other characters to form a phrase, and 3 indicates that the phrase cannot be combined with other characters to form a phrase. For example, "Hunnan District, Shenyang City, Liaoning Province", after atomic segmentation, the results obtained are "Liao", "Ning", "province", "Shenyang", "Yang", "city", "Hun", For the Chinese characters "Nan" and "District", when merging, the status identifier of "Liaoning" is 1. It can continue to be merged with "Ning" to obtain "Liaoning" and the status identifier of "Liaoning" in the core thesaurus. is 2, you can continue to merge it with adjacent characters to get "Liaoning Province". The status of this phrase in the core vocabulary is 3, and you can't continue to merge it. Then you will finally get the word "Liaoning Province", and then the word will be determined. is the first participle.

在一些实施例中，原子切分后得到的单字包括英文单词，则将相邻的英文单词合并后与预设英文词库进行匹配，得到第一分词。例如“Four Seasons Hotel”切分后得到的单字为“Four”、“Seasons”、“Hotel”，则得到第一分词为“Four Seasons Hotel”。在另外一些实施例中，原子切分后得到的单字包括多个数字，例如“101大厦”，原子切分后得到的单字为“1”、“0”、“1”、“大厦”，则得到的第一分词为“101”和“大厦”。在另外一些实施例中，原子切分后得到的单字包括繁体字，则在组合之前将该繁体字与预设的简繁词库进行匹配，转换为简体字，例如，“張”转换为“张”。In some embodiments, if the words obtained after atomic segmentation include English words, the adjacent English words are merged and matched with the preset English vocabulary to obtain the first word segmentation. For example, after "Four Seasons Hotel" is segmented, the words obtained are "Four", "Seasons", and "Hotel", and the first participle is "Four Seasons Hotel". In other embodiments, the words obtained after atomic segmentation include multiple numbers, such as "101 Building", and the words obtained after atomic segmentation are "1", "0", "1", and "building", then The first participles obtained are "101" and "building". In some other embodiments, the single characters obtained after atomic segmentation include traditional Chinese characters, and then the traditional Chinese characters are matched with the preset simplified and traditional lexicon before combination, and converted into simplified characters. For example, "张" is converted into "张" ".

在一些实施例中，组合后的词组可能为核心词库中的未登陆词，则根据该未登录词的搜索频率选择性将该词添加进核心词库。例如，新建的一座大厦，该大厦的搜索频率高于一个预设值，则表示该未登录词经常性被搜索，属于一个常用的地址元素，则将该词添加进核心词库。未登录词还有可能包含有一些专有名词，例如，人名、缩写等，在一些实施例中，如果专有名词未包含在核心词库中，则通过预设一个相关未登录词库，将这些专有名词分离出来，以便后续进行地址识别。In some embodiments, the combined phrase may be an unregistered word in the core lexicon, and the word is selectively added to the core lexicon according to the search frequency of the unregistered word. For example, if the search frequency of a newly built building is higher than a preset value, it means that the unregistered word is frequently searched and belongs to a commonly used address element, so the word will be added to the core thesaurus. Unregistered words may also contain some proper nouns, such as names, abbreviations, etc. In some embodiments, if the proper nouns are not included in the core vocabulary, a related unregistered vocabulary will be preset. These proper nouns are separated for subsequent address identification.

在一些实施例中，存在两个第一分词同时包含了用户输入的地址信息中的某一个词，从第一分词与预设词关联表匹配，确定最优的分词组合。例如，“商品和服务交易中心”，得到的第一分词有可能有两种不同的情况，第一种包括以下几个分词“商品、和、服务、交易中心”，第二种包括以下几个分词“商品、和服、务、”，则将分词结果与预设词关联表匹配，该预设词关联表中包括起始词、终止词、词频，例如，“服”在词关联表中是一个起始词，“务”在词关联表中是一个终止词，且“服务”的词频相对较高，则将第一种确定为最优的分词组合，并将第一种分词组合作为分词结果。分词结果中的词合并后可以得到用户输入的地址信息。In some embodiments, there are two first word segments that simultaneously contain a certain word in the address information input by the user. The optimal word segmentation combination is determined by matching the first word segmentation with the preset word association table. For example, "goods and services trading center", the first participle obtained may have two different situations. The first includes the following participles: "goods, and, services, trading center", and the second includes the following: For word segmentation "goods, services, services,", the word segmentation result is matched with the preset word association table. The preset word association table includes starting words, terminators, and word frequencies. For example, "service" is in the word association table. A starting word, "service" is a terminator in the word association table, and the word frequency of "service" is relatively high, then the first one is determined as the optimal word segmentation combination, and the first word segmentation combination is used as the word segmentation result. After combining the words in the word segmentation results, the address information input by the user can be obtained.

步骤130：根据分词结果确定该地址信息在概率有限状态机中的第一迁移状态。Step 130: Determine the first transition state of the address information in the probabilistic finite state machine according to the word segmentation result.

在本步骤中，概率有限状态机中包括状态节点和相邻的状态节点之间的转移概率。图3示出了概率有限状态机的结构示意图，如图3所示，概率有限状态机中包含了多个状态节点，多个状态节点中包括了表示开始状态的门址类别节点、表示终止状态的门址类别节点、中间门址类别节点、非地址节点，状态节点按照地址元素的等级依次降低。根据分词结果在概率状态机中匹配迁移状态，迁移状态包括状态节点之间的转移顺序及转移概率。例如，分词结果中包括“辽宁省”和“沈阳市”，在概率有限状态机中，与“辽宁省”连接的状态节点包括“沈阳市”、“大连市”等十四个状态节点，“辽宁省”与这十四个状态节点之间相互连接的概率均为1/14，则将“辽宁省”和“沈阳市”以及它们之间的概率1/14作为第一迁移状态中的一部分迁移状态，所有状态节点和状态节点之间的转移概率为该地址信息在概率有限状态机中的第一迁移状态。例如，当用户的输入词为“沈阳市浑南区桃仙镇机场路1号桃仙国际机场”时，概率状态机匹配到迁移顺序为“地级市—区县—乡镇—道路—建筑单元”，则第一迁移状态表示为其中，s₀表示开始的状态节点，q₀，q₁...q₄表示中间的状态节点，q₅表示结束的状态节点，x表示相邻节点之间的转移概率。该应理解，第一迁移状态是按照地址信息中分词结果的顺序排列的，一般来说，如果地址信息是按照标准的地址信息输入的，第一迁移状态中的状态节点是按照地址元素等级依次降低的顺序进行迁移的。当地址信息匹配到的迁移状态包括从开始节点到终止节点的整个路径时，地址信息可能为一个有效的地址信息。概率有限状态机是根据用户历史输入的地址信息训练得到的，具体的训练过程在下面的实施例中进行说明，在此不做赘述。In this step, the probabilistic finite state machine includes transition probabilities between state nodes and adjacent state nodes. Figure 3 shows a schematic structural diagram of a probabilistic finite state machine. As shown in Figure 3, the probabilistic finite state machine contains multiple state nodes. The multiple state nodes include gate address category nodes representing the start state and end state. The door address category nodes, intermediate door address category nodes, non-address nodes, and status nodes decrease in order according to the level of the address element. The migration state is matched in the probabilistic state machine according to the word segmentation results. The migration state includes the transition sequence and transition probability between state nodes. For example, the word segmentation results include "Liaoning Province" and "Shenyang City". In the probabilistic finite state machine, the state nodes connected to "Liaoning Province" include "Shenyang City", "Dalian City" and other fourteen state nodes. The probability of interconnection between "Liaoning Province" and these fourteen state nodes is all 1/14, then "Liaoning Province" and "Shenyang City" and the probability 1/14 between them are regarded as part of the first migration state Migration state, all state nodes and the transition probability between state nodes is the first transition state of the address information in the probabilistic finite state machine. For example, when the user's input word is "Taoxian International Airport, No. 1 Airport Road, Taoxian Town, Hunnan District, Shenyang City", the probability state machine matches the migration sequence as "prefecture-level city - district and county - township - road - building unit ”, then the first migration state is expressed as Among them, s ₀ represents the starting state node, q ₀ , q ₁ ... q ₄ represents the intermediate state node, q ₅ represents the ending state node, and x represents the transition probability between adjacent nodes. It should be understood that the first migration state is arranged in the order of the word segmentation results in the address information. Generally speaking, if the address information is input according to the standard address information, the status nodes in the first migration state are arranged in order according to the address element level. Migration is performed in descending order. When the migration state matched by the address information includes the entire path from the start node to the end node, the address information may be a valid address information. The probabilistic finite state machine is trained based on the address information input by the user historically. The specific training process is explained in the following embodiments and will not be described in detail here.

步骤140：根据第一迁移状态对应的迁移概率计算地址信息在概率有限状态机中的第一概率。Step 140: Calculate the first probability of the address information in the probabilistic finite state machine according to the transition probability corresponding to the first transition state.

在本步骤中，将第一迁移状态中所有的迁移概率相乘，得到该地址信息在概率有限状态机中的第一概率。对于用户输入的任意地址信息，分词后得到的分词结果包括k个分词。对应这k个分词在概率有限状态机中的第一迁移状态为则第一概率为/>例如，当用户输入词为“沈阳市浑南区桃仙镇机场路1号”时，以附图3中的概率有限状态机为例，第一迁移状态为“开始状态—地级市—区县—乡镇—道路—街道号”第一概率为：1×0.4×0.2×0.6×1＝0.048。In this step, all transition probabilities in the first transition state are multiplied together to obtain the first probability of the address information in the probabilistic finite state machine. For any address information input by the user, the word segmentation result obtained after word segmentation includes k word segments. The first transition state corresponding to these k word segments in the probabilistic finite state machine is Then the first probability is/> For example, when the user inputs the word "No. 1, Airport Road, Taoxian Town, Hunnan District, Shenyang City", taking the probabilistic finite state machine in Figure 3 as an example, the first transition state is "starting state-prefecture-level city-district" The first probability of "county-township-road-street number" is: 1×0.4×0.2×0.6×1=0.048.

步骤150：在第一概率小于预设阈值的情况下，根据第一迁移状态确定地址信息中是否包含错别字。Step 150: When the first probability is less than the preset threshold, determine whether the address information contains typos according to the first migration state.

在本步骤中，预设阈值是一个人为设定的值，当第一概率达到该预设阈值时，可以确定地址信息是一个有效地址。概率有限状态机中的状态节点包括地址节点和非地址节点，地址节点表示一个有效的地址元素，例如“辽宁省”、“沈阳市”等，非地址节点表示有效的地址元素以外的状态节点，例如“沈洋市”。当第一概率小于预设阈值时，可能是由于地址信息中包含有错别字，根据该错别字匹配不到正确的状态节点中的地址节点，导致第一概率小于预设阈值。如果第一迁移状态中包含非地址节点，且非地址节点之前的状态节点的迁移概率均大于预设迁移概率，则确定地址信息中包含错别字。例如，预设阈值为0.03，预设迁移概率为0.1，当用户输入词为“沈阳市浑南区桃仙镇桃先国际机场机场路1号”时，第一迁移概率中的状态节点为：开始状态—地级市—区县—乡镇—非地址节点—道路—街道号，其中，任何一个地址节点转移到非地址节点的概率相等，当包含非地址节点时，认为状态转移节点包括非地址节点和下一地址节点，状态节点转移后可以到达终止节点，第一概率计算结果为1×0.4×0.2×0.02×0.6×1＝0.00096，概率值远低于预设阈值，路径中存在非地址节点，且非地址节点前后的节点迁移概率分别为1和0.4，均大于预设迁移概率，因此确定地址信息中包含错别字。In this step, the preset threshold is an artificially set value. When the first probability reaches the preset threshold, it can be determined that the address information is a valid address. The state nodes in the probabilistic finite state machine include address nodes and non-address nodes. The address node represents a valid address element, such as "Liaoning Province", "Shenyang City", etc., and the non-address node represents a state node other than a valid address element. For example, "Shenyang City". When the first probability is less than the preset threshold, it may be because the address information contains a typo and the correct address node in the status node cannot be matched according to the typo, causing the first probability to be less than the preset threshold. If the first migration state includes a non-address node, and the migration probabilities of state nodes before the non-address node are all greater than the preset migration probability, it is determined that the address information contains a typo. For example, the preset threshold is 0.03 and the preset migration probability is 0.1. When the user input word is "No. 1, Taoxian International Airport Airport Road, Taoxian Town, Hunnan District, Shenyang City", the state node in the first migration probability is: Starting state - prefecture-level city - district or county - township - non-address node - road - street number. Among them, the probability of any address node transferring to a non-address node is equal. When a non-address node is included, the state transition node is considered to include a non-address node. node and the next address node, the termination node can be reached after the state node is transferred, the first probability calculation result is 1×0.4×0.2×0.02×0.6×1=0.00096, the probability value is far lower than the preset threshold, and there is a non-address in the path node, and the node migration probabilities before and after the non-address node are 1 and 0.4 respectively, both of which are greater than the preset migration probability, so it is determined that the address information contains typos.

当地址信息中不包含错别字时，确定用户输入的地址信息无效。When the address information does not contain typos, it is determined that the address information entered by the user is invalid.

步骤160：在地址信息中包含错别字的情况下，对所述错别字进行纠错，得到纠错后的地址信息在概率有限状态机中的第二迁移状态。Step 160: When the address information contains typos, correct the typos to obtain the second transition state of the corrected address information in the probabilistic finite state machine.

在本步骤中，考虑到用户使用的输入法大部分为拼音输入法，输入的错别字最有可能是同音字，因此，对同音字进行纠错。将所述第一迁移状态中包含的非地址节点对应的文本信息确定为纠错词；将纠错词转换为拼音；根据拼音在预设的错别字词库中查找与该拼音相匹配的纠错词对象；将该纠错词对象中搜索次数最高的词作为纠错对比对象；计算纠错词的搜索次数与纠错对比对象的搜索次数之间的比重；如果该比重小于预设比重，则将纠错词替换为纠错对比对象；将替换后的地址信息在概率有限状态机中的迁移状态确定为第二迁移状态。其中，错别字词库中预存了拼音和该拼音对应的所有词。纠错对比对象的搜索次数为搜索日志中该词的记录次数，纠错词的搜索次数为该词在搜索日志中的记录次数。在一些实施例中，根据经验公式Num*(1+Len/10+IsStr)确定纠错词的搜索次数，其中，Num表示纠错词在搜索日志中的记录次数，Len表示纠错词的词长度，IsStr表示该纠错词是中文或拼音时的取值，当纠错词是中文时，取值为0，当纠错词为拼音时，取值为1。例如，当用户输入词为“沈阳市浑南区桃先镇桃先国际机场机场路1号”时，纠错后的地址信息为“沈阳市浑南区桃仙镇桃仙国际机场机场路1号”。In this step, considering that most of the input methods used by users are Pinyin input methods, the inputted misspelled words are most likely to be homophones. Therefore, the homophones are corrected. Determine the text information corresponding to the non-address node included in the first migration state as the error correction word; convert the error correction word into pinyin; search for the error correction word matching the pinyin in the preset typographical word library according to the pinyin Word object; use the word with the highest number of searches in the error correction word object as the error correction comparison object; calculate the ratio between the number of searches for the error correction word and the number of searches for the error correction comparison object; if the ratio is less than the preset ratio, then The error correction word is replaced with the error correction comparison object; the transition state of the replaced address information in the probabilistic finite state machine is determined as the second transition state. Among them, pinyin and all words corresponding to the pinyin are pre-stored in the misspelled word database. The number of searches for the error correction comparison object is the number of times the word is recorded in the search log, and the number of searches for the error correction word is the number of times the word is recorded in the search log. In some embodiments, the number of searches for the error correction word is determined according to the empirical formula Num*(1+Len/10+IsStr), where Num represents the number of times the error correction word is recorded in the search log, and Len represents the word of the error correction word. Length, IsStr represents the value when the error correction word is Chinese or Pinyin. When the error correction word is Chinese, the value is 0. When the error correction word is Pinyin, the value is 1. For example, when the user inputs the word "No. 1, Airport Road, Taoxian International Airport, Taoxian Town, Hunnan District, Shenyang City", the corrected address information is "No. 1, Airport Road, Taoxian International Airport, Taoxian Town, Hunnan District, Shenyang City" Number".

步骤170：根据第二迁移状态计算地址信息在概率有限状态机中的第二概率。Step 170: Calculate the second probability of the address information in the probabilistic finite state machine according to the second transition state.

根据第二迁移状态计算第二概率时，第二迁移状态中不包含非地址节点，例如，当用户输入词为“沈阳市浑南区桃仙镇桃先国际机场机场路1号”时，纠错后的第二迁移状态中的状态节点为：地级市+区县+乡镇+道路+街道号，第二概率计算结果为1×0.4×0.2×0.6×1＝0.048。When calculating the second probability based on the second migration state, the second migration state does not contain non-address nodes. For example, when the user inputs the word "No. 1, Taoxian International Airport Airport Road, Taoxian Town, Hunnan District, Shenyang City", the correction The state node in the second migration state after the error is: prefecture-level city + district and county + township + road + street number, and the second probability calculation result is 1×0.4×0.2×0.6×1=0.048.

步骤180：在第二概率大于或等于预设阈值的情况下，确定用户输入的地址信息为有效地址。Step 180: If the second probability is greater than or equal to the preset threshold, determine that the address information input by the user is a valid address.

在本步骤中，如果第二概率小于预设阈值，确定用户输入的地址信息为非有效地址。In this step, if the second probability is less than the preset threshold, it is determined that the address information input by the user is an invalid address.

图4示出了本发明第二实施例的一种地址识别方法的流程图，与第一实施例相比，本发明实施例在步骤170之后，还包括以下步骤，如图4所示，该方法包括以下步骤：Figure 4 shows a flow chart of an address identification method according to the second embodiment of the present invention. Compared with the first embodiment, the embodiment of the present invention also includes the following steps after step 170. As shown in Figure 4, The method includes the following steps:

步骤210：在第二概率小于预设阈值的情况下，确定状态节点的迁移顺序。Step 210: If the second probability is less than the preset threshold, determine the migration order of the status nodes.

在本发明实施例中，当地址信息中不包含错别字，但是第一迁移状态可以到达终止节点，但是第一概率小于预设阈值，有可能是因为状态节点的顺序不正确导致的，例如，输入地址信息“沈阳市桃仙国际机场浑南区桃仙镇”，迁移顺序为地级市—街道—区县—乡镇。In the embodiment of the present invention, when the address information does not contain typos, but the first migration state can reach the termination node, but the first probability is less than the preset threshold, it may be because the order of the state nodes is incorrect, for example, input The address information is "Taoxian Town, Hunnan District, Shenyang Taoxian International Airport", and the migration order is prefecture-level city-street-district-county-township.

步骤220：当迁移顺序与预设顺序不一致时，将迁移顺序调整至与预设顺序一致。Step 220: When the migration order is inconsistent with the preset order, adjust the migration order to be consistent with the preset order.

在本步骤中，预设顺序为按照地址元素的等级由高到低排列的顺序。以步骤210中输入地址信息为“沈阳市桃仙国际机场浑南区桃仙镇机场路”为例，调整后的顺序为“沈阳市浑南区桃仙镇桃仙国际机场”。In this step, the preset order is the order in which the address elements are arranged from high to low. Taking the address information input in step 210 as "Airport Road, Taoxian Town, Hunnan District, Shenyang Taoxian International Airport" as an example, the adjusted order is "Taoxian International Airport, Taoxian Town, Hunnan District, Shenyang City".

步骤230：根据调整后的迁移状态计算地址信息在概率有限状态机中的第三概率。Step 230: Calculate the third probability of the address information in the probabilistic finite state machine according to the adjusted migration state.

在本步骤中，第三概率的计算过程与步骤140的计算过程相同，请参考步骤140的具体描述，在此不再赘述。In this step, the calculation process of the third probability is the same as the calculation process of step 140. Please refer to the specific description of step 140, which will not be described again here.

步骤240：若第三概率大于或等于预设阈值，确定用户输入的地址信息为有效地址。Step 240: If the third probability is greater than or equal to the preset threshold, determine that the address information input by the user is a valid address.

在本步骤中，如果第三概率小于预设阈值，则用户输入的地址信息为无效地址。In this step, if the third probability is less than the preset threshold, the address information input by the user is an invalid address.

本发明实施例在用户输入的地址信息的顺序与预设顺序不一致的情况下对迁移顺序进行调整，从而在用户输入地址信息顺序与预设顺序不一致时也能够对用户输入的地址信息进行有效判断。The embodiment of the present invention adjusts the migration order when the order of the address information input by the user is inconsistent with the preset order, so that the address information input by the user can be effectively judged even when the order of the address information input by the user is inconsistent with the preset order. .

图5示出了本发明第三实施例的一种地址识别方法的流程图，本发明实施例在执行第一实施例中的步骤和第二实施例中的步骤之前，还包括以下步骤：Figure 5 shows a flow chart of an address identification method according to the third embodiment of the present invention. Before executing the steps in the first embodiment and the steps in the second embodiment, the embodiment of the present invention further includes the following steps:

步骤310：获取用户历史输入的地址信息，得到训练样本。Step 310: Obtain the address information input by the user historically and obtain training samples.

在本步骤中，用户历史输入的地址信息是所有用户在使用地图应用软件时输入的地址信息，每一条地址信息作为一个训练样本。In this step, the address information historically input by the user is the address information input by all users when using the map application software, and each piece of address information is used as a training sample.

步骤320：从训练样本提取有限状态机的状态节点。Step 320: Extract state nodes of the finite state machine from the training samples.

在本步骤中，对每一个训练样本进行分词，并将分词结果作为状态节点。分词过程与第一实施例中步骤120的分词过程相同，请参阅第一实施例中步骤120的描述，在此不再赘述。In this step, each training sample is segmented, and the segmentation results are used as status nodes. The word segmentation process is the same as the word segmentation process in step 120 in the first embodiment. Please refer to the description of step 120 in the first embodiment, which will not be described again here.

步骤330：确定训练样本在状态节点之间的迁移路径。Step 330: Determine the migration path of the training sample between state nodes.

迁移路径包括每一个地址信息对应的状态节点之间的迁移顺序。The migration path includes the migration sequence between state nodes corresponding to each address information.

步骤340：根据迁移路径，通过隐马尔可夫模型计算相邻的状态节点之间的迁移概率。Step 340: Calculate the migration probability between adjacent state nodes through the hidden Markov model according to the migration path.

根据所有训练样本的迁移路径，确定迁移概率。例如，训练样本包含6000万地址信息，在6000万地址信息中，“辽宁省”迁移的状态节点出现2000万次，在这2000万次中，迁移到“沈阳市”的状态节点出现了1000万次，则“辽宁省—沈阳市”的迁移概率为0.5。Based on the migration paths of all training samples, the migration probability is determined. For example, the training sample contains 60 million address information. Among the 60 million address information, the status node migrated to "Liaoning Province" appears 20 million times. Among these 20 million times, the status node migrated to "Shenyang City" appears 10 million times. times, then the migration probability of "Liaoning Province-Shenyang City" is 0.5.

步骤350：将包含状态节点之间的迁移概率的有限状态机作为概率有限状态机。Step 350: Use a finite state machine containing transition probabilities between state nodes as a probabilistic finite state machine.

根据迁移路径，确定开始状态节点和终止状态节点，其中，开始状态节点包括省份、城市、区县等所有状态节点，即每一个状态节点均可以作为开始状态节点。终止状态节点包括道路、建筑单元等，开始状态节点和终止状态节点可以是同一个状态节点。According to the migration path, the start status node and the end status node are determined. The start status node includes all status nodes such as provinces, cities, districts and counties, that is, each status node can be used as a start status node. End state nodes include roads, building units, etc. The start state node and the end state node can be the same state node.

本发明实施例根据用户历史输入的地址信息构建概率有限状态机，便于根据构建的有限状态机对用户输入的地址信息进行识别。The embodiment of the present invention constructs a probabilistic finite state machine based on the address information input by the user in history, so as to facilitate the identification of the address information input by the user based on the constructed finite state machine.

图6示出了本发明第四实施例的一种地址识别装置的功能框图。如图6所示，该装置包括：获取模块410，用于获取用户输入的地址信息；分词模块420，用于对所述地址信息进行分词，得到分词结果；第一确定模块430，用于根据所述分词结果确定所述地址信息在概率有限状态机中的第一迁移状态；第一计算模块440，用于根据所述第一迁移状态计算所述地址信息在所述概率有限状态机中的第一概率；第二确定模块450，用于当所述第一概率小于预设阈值时，根据所述第一迁移状态确定所述地址信息中是否包含错别字；纠错模块460，用于当所述地址信息中包含错别字时，对所述错别字进行纠错，得到纠错后的地址信息在所述概率有限状态机中的第二迁移状态；第二计算模块470，用于根据所述第二迁移状态计算所述地址信息在所述概率有限状态机中的第二概率；第三确定模块480，用于当所述第二概率大于或等于所述预设阈值，确定用户输入的所述地址信息为有效地址。Figure 6 shows a functional block diagram of an address identification device according to the fourth embodiment of the present invention. As shown in Figure 6, the device includes: an acquisition module 410, used to obtain the address information input by the user; a word segmentation module 420, used to segment the address information to obtain a word segmentation result; a first determination module 430, used according to The word segmentation result determines the first transition state of the address information in the probabilistic finite state machine; the first calculation module 440 is used to calculate the first transition state of the address information in the probabilistic finite state machine according to the first transition state. the first probability; the second determination module 450, used to determine whether the address information contains typos according to the first migration state when the first probability is less than the preset threshold; the error correction module 460, used to determine whether the address information contains typos when the first probability is less than the preset threshold; When the address information contains typos, the typos are corrected to obtain the second transition state of the corrected address information in the probabilistic finite state machine; the second calculation module 470 is used to calculate the error according to the second The transition state calculates the second probability of the address information in the probabilistic finite state machine; the third determination module 480 is used to determine the address input by the user when the second probability is greater than or equal to the preset threshold. The information is a valid address.

在一种可选的方式中，纠错模块460进一步用于：In an optional manner, the error correction module 460 is further used to:

在一种可选的方式中，所述装置还包括第四确定模块490，用于当所述地址信息中包含错别字时，确定所述状态节点的迁移顺序；调整模块400，用于当所述迁移顺序与预设顺序不一致时，将所述迁移顺序调整至与所述预设顺序一致；第三计算模块401，用于根据调整后的所述迁移状态计算所述地址信息在所述概率有限状态机中的第三概率；第五确定模块402，用于当所述第三概率大于或等于所述预设阈值时，确定用户输入的所述地址信息为有效地址。In an optional manner, the device further includes a fourth determination module 490, configured to determine the migration sequence of the status node when the address information contains a typo; an adjustment module 400, configured to determine the migration sequence of the status node when the address information contains a typo. When the migration sequence is inconsistent with the preset sequence, the migration sequence is adjusted to be consistent with the preset sequence; the third calculation module 401 is used to calculate the probability of the address information being limited according to the adjusted migration state. The third probability in the state machine; the fifth determination module 402 is configured to determine that the address information input by the user is a valid address when the third probability is greater than or equal to the preset threshold.

在一种可选的方式中，所述装置还包括：In an optional manner, the device further includes:

第一获取模块403，用于获取用户历史输入的地址信息，得到训练样本；The first acquisition module 403 is used to acquire the address information input by the user historically and obtain training samples;

提取模块404，用于从所述训练样本提取有限状态机的状态节点；Extraction module 404, used to extract state nodes of finite state machines from the training samples;

第六确定模块405，用于确定所述训练样本在所述状态节点之间的迁移路径；The sixth determination module 405 is used to determine the migration path of the training sample between the state nodes;

第四计算模块406，根据所述迁移路径，通过隐马尔可夫模型计算相邻的所述状态节点之间的迁移概率；The fourth calculation module 406 calculates the migration probability between adjacent state nodes through the hidden Markov model according to the migration path;

第七确定模块407，用于将包含所述状态节点之间的迁移概率的有限状态机作为概率有限状态机。The seventh determination module 407 is configured to use the finite state machine containing the transition probability between the state nodes as a probabilistic finite state machine.

在一种可选的方式中，分词模块420进一步用于：In an optional manner, the word segmentation module 420 is further used to:

本发明实施例通过分词模块420将用户输入的地址信息进行分词，得到分词结果，第一确定模块430根据分词结果确定地址信息在概率有限状态机中的迁移状态，概率有限状态机包括状态节点和状态节点之间的迁移概率，迁移状态包含了地址信息对应的状态节点和各状态节点之间的迁移概率，根据该迁移状态对应的迁移概率计算用户输入的地址信息在概率有限状态机中的概率，当该概率小于预设阈值时，说明用户输入的地址信息中可能包含错别字，则通过纠错模块460对地址信息中包含的错别字进行纠错，并根据纠错后的地址信息确定用户输入的地址信息是否是有效地址。通过本发明实施例，实现了根据概率有限状态机对地址信息进行识别，并且能够在用户输入的地址信息中包含错别字时，对地址信息进行有效识别，提高了用户体验。In the embodiment of the present invention, the address information input by the user is segmented into words through the word segmentation module 420 to obtain the word segmentation result. The first determination module 430 determines the migration state of the address information in the probabilistic finite state machine according to the word segmentation result. The probabilistic finite state machine includes state nodes and Migration probability between state nodes. The migration state includes the state node corresponding to the address information and the migration probability between each state node. According to the migration probability corresponding to the migration state, the probability of the address information input by the user in the probabilistic finite state machine is calculated. , when the probability is less than the preset threshold, it means that the address information input by the user may contain typos, then the typos contained in the address information are corrected through the error correction module 460, and the address information input by the user is determined based on the corrected address information. Whether the address information is a valid address. Through the embodiments of the present invention, address information can be identified based on a probabilistic finite state machine, and when the address information input by the user contains typos, the address information can be effectively identified, thereby improving the user experience.

本发明实施例提供了一种非易失性计算机存储介质，所述计算机存储介质存储有至少一可执行指令，该计算机可执行指令可执行上述任意方法实施例中的地址识别方法。Embodiments of the present invention provide a non-volatile computer storage medium that stores at least one executable instruction. The computer executable instruction can execute the address identification method in any of the above method embodiments.

图7示出了本发明第五实施例中的一种计算设备的结构示意图，本发明具体实施例并不对计算设备的具体实现做限定。Figure 7 shows a schematic structural diagram of a computing device in the fifth embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the computing device.

如图7所示，该计算设备可以包括：处理器(processor)502、通信接口(Communications Interface)504、存储器(memory)506、以及通信总线508。As shown in FIG. 7 , the computing device may include: a processor 502 , a communications interface 504 , a memory 506 , and a communications bus 508 .

其中：处理器502、通信接口504、以及存储器506通过通信总线408完成相互间的通信。通信接口504，用于与其它设备比如客户端或其它服务器等的网元通信。处理器502，用于执行程序510，具体可以执行上述用于地址识别方法实施例中的相关步骤。Among them: the processor 502, the communication interface 504, and the memory 506 complete communication with each other through the communication bus 408. The communication interface 504 is used to communicate with network elements of other devices such as clients or other servers. The processor 502 is configured to execute the program 510. Specifically, it can execute the relevant steps in the above embodiment of the address identification method.

具体地，程序510可以包括程序代码，该程序代码包括计算机操作指令。Specifically, program 510 may include program code including computer operating instructions.

处理器502可能是中央处理器CPU，或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 502 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computing device may be the same type of processor, such as one or more CPUs; or they may be different types of processors, such as one or more CPUs and one or more ASICs.

存储器506，用于存放程序510。存储器506可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。Memory 506 is used to store programs 510. The memory 506 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

程序510具体可以用于使得处理器502执行图2中的步骤110～步骤180，图4中的步骤210～步骤240，图5中的步骤310～步骤350，以及实现图6中的模块410～模块407中的功能。The program 510 can be specifically used to cause the processor 502 to execute steps 110 to 180 in Figure 2, steps 210 to 240 in Figure 4, steps 310 to 350 in Figure 5, and implement modules 410 to 6 in Figure 6 Functions in module 407.

在此提供的算法或显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明实施例也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms or displays provided herein are not inherently associated with any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. From the above description, the structure required to construct such a system is obvious. Furthermore, embodiments of the present invention are not directed to any specific programming language. It should be understood that a variety of programming languages may be utilized to implement the invention described herein, and that the above descriptions of specific languages are intended to disclose the best mode of carrying out the invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the instructions provided here, a number of specific details are described. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.

类似地，应当理解，为了精简本发明并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明实施例的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it will be understood that in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together into a single implementation in order to streamline the invention and assist in understanding one or more of the various inventive aspects. examples, diagrams, or descriptions thereof. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that modules in the devices in the embodiment can be adaptively changed and arranged in one or more devices different from that in the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of the equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。上述实施例中的步骤，除有特殊说明外，不应理解为对执行顺序的限定。Furthermore, those skilled in the art will understand that although some embodiments herein include certain features included in other embodiments but not others, combinations of features of different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the element claim enumerating several means, several of these means may be embodied by the same item of hardware. The use of the words first, second, third, etc. does not indicate any order. These words can be interpreted as names. Unless otherwise specified, the steps in the above embodiments should not be understood as limiting the order of execution.

Claims

1. An address identification method, the method comprising:

acquiring address information input by a user;

performing word segmentation on the address information to obtain a word segmentation result;

determining a first migration state of the address information in a probability finite state machine according to the word segmentation result, wherein the probability finite state machine comprises a state node and migration probability between the state nodes; the state node comprises an address node and a non-address node;

calculating a first probability of the address information in the probability finite state machine according to the migration probability corresponding to the first migration state;

if the first probability is smaller than a preset threshold, determining whether the address information contains wrongly written characters according to the first migration state includes: if the first probability is smaller than the preset threshold value, determining the migration probability between the state node and the state node contained in the first migration state; if the state nodes contained in the first migration state comprise non-address nodes, determining a first migration probability of the state nodes before the non-address nodes and a second migration probability of the state nodes after the non-address nodes; if the first migration probability and the second migration probability are both larger than the preset migration probability, determining that the address information contains wrongly written words;

If the address information contains wrongly written words, correcting the wrongly written words to obtain a second migration state of the corrected address information in the probability finite state machine;

calculating a second probability of the address information in the probability finite state machine according to the second migration state;

and if the second probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.

2. The method of claim 1, wherein if the address information includes a wrongly written word, correcting the wrongly written word to obtain a second migration state of the corrected address information in the probabilistic finite state machine, comprising:

determining text information corresponding to the non-address node contained in the first migration state as an error correction word;

converting the error correction words into pinyin;

searching an error correction word object matched with the pinyin in a preset wrongly written word library according to the pinyin;

taking the word with the highest searching frequency in the error correction word object as an error correction comparison object;

calculating the proportion between the searching times of the error correction words and the searching times of the error correction comparison objects;

If the specific gravity is smaller than the preset specific gravity, replacing the error correction word with the error correction comparison object;

and determining the migration state of the replaced address information in the probability finite state machine as the second migration state.

3. The method of claim 1, wherein if the address information does not contain misplaced words, the method further comprises:

determining the migration sequence of the state nodes;

when the migration sequence is inconsistent with a preset sequence, the migration sequence is adjusted to be consistent with the preset sequence;

calculating a third probability of the address information in the probability finite state machine according to the adjusted migration state;

and if the third probability is greater than or equal to the preset threshold value, determining that the address information input by the user is an effective address.

4. The method of claim 1, wherein prior to obtaining the address information entered by the user, the method further comprises:

acquiring address information input by a user in a history way to obtain a training sample;

extracting state nodes of a finite state machine from the training samples;

determining a migration path of the training sample between the state nodes;

Calculating migration probability between adjacent state nodes through a hidden Markov model according to the migration path;

and taking a finite state machine containing migration probability among the state nodes as a probability finite state machine.

5. The method of claim 1, wherein performing chinese word segmentation on the address information to obtain a word segmentation result comprises:

performing atomic segmentation on the address information to obtain a plurality of single words;

combining adjacent single words according to different combining modes to obtain a first word segmentation;

and matching the first word segmentation with a preset word association table to obtain a word segmentation result.

6. The method of claim 1, wherein after obtaining the first segmentation, the method further comprises:

if the first word segmentation contains the unregistered word, taking the unregistered word as a word segmentation result under the condition that the search frequency of the unregistered word is greater than a preset value.

7. An address identifying apparatus, the apparatus comprising:

the acquisition module is used for acquiring address information input by a user;

the word segmentation module is used for segmenting the address information to obtain a word segmentation result;

the first determining module is used for determining a first migration state of the address information in a probability finite state machine according to the word segmentation result; the probability finite state machine comprises a state node and a transition probability between the state nodes; the state node comprises an address node and a non-address node;

The first calculation module is used for calculating a first probability of the address information in the probability finite state machine according to the first migration state; the state node comprises an address node and a non-address node;

the second determining module is configured to determine, according to the first migration state, whether the address information includes a wrongly written word when the first probability is smaller than a preset threshold, where the second determining module includes: if the first probability is smaller than the preset threshold value, determining the migration probability between the state node and the state node contained in the first migration state; if the state nodes contained in the first migration state comprise non-address nodes, determining a first migration probability of the state nodes before the non-address nodes and a second migration probability of the state nodes after the non-address nodes; if the first migration probability and the second migration probability are both larger than the preset migration probability, determining that the address information contains wrongly written words;

the error correction module is used for correcting the wrongly written words when the address information contains wrongly written words, and obtaining a second migration state of the address information after error correction in the probability finite state machine;

A second calculation module, configured to calculate a second probability of the address information in the probability finite state machine according to the second migration state;

and the third determining module is used for determining that the address information input by the user is an effective address when the second probability is greater than or equal to the preset threshold value.

8. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to an address identification method according to any one of claims 1 to 6.

9. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to an address identification method according to any one of claims 1 to 6.