HK1251672B

HK1251672B - System and method for automated address verification

Info

Publication number: HK1251672B
Application number: HK18110864.2A
Authority: HK
Inventors: 桐‧陈; S‧巴勒斯基; A‧雷斯克卡拉
Original assignee: 环联公司
Priority date: 2015-11-24
Filing date: 2016-11-22
Publication date: 2023-12-29

Description

System and method for automatic address verification

Cross reference to related applications

The present application claims the benefit of U.S. provisional patent application No. 62/259,507, filed on 11 months 24 2015, the entire contents of which are incorporated herein by reference.

Technical Field

The invention relates to an address matching system and method. More particularly, various embodiments of the present invention provide a system and method for address matching by: extracting certain strings or tokens from the address; storing the extracted attributes and associated modifiers in an address attribute container; and comparing attributes in a hierarchical manner based on information built in the address attribute container.

Background

The consumer loan industry makes decisions to grant credit or to issue loans or to give priority to consumers of credit or loan conditions based on the general principle of risk (i.e., the risk of losing collateral redemption). Credit and loan institutions typically avoid granting credit or loans to high risk consumers, or may grant credit or loans to such consumers at higher interest rates or under other conditions than those typically granted to low risk consumers. Consumer data including consumer credit information is collected and used by credit bureaus, financial institutions and other entities to evaluate the credit value and aspects of the consumer's financial and credit records.

In many emerging and developing markets, available consumer data may be of lower quality than consumer data available in developed markets. For example, records of consumer data may not contain unique identification numbers, address formats may vary, birth dates may be unreliable or non-existent, naming conventions may vary, and specific names and surnames may be very popular and reused in a large population. Traditional consumer data search algorithms commonly used in developed markets do not always perform well for consumer data in emerging markets. Such conventional algorithms rely on consistent formatting of consumer data, more complete information, and information in discrete fields such as house numbers, street names, telephones, postal codes, and identification numbers. In developed markets, searches for consumer data may be performed relatively quickly by using well-indexed relational database keys that utilize a single field (e.g., identification number or phone) or a combined key (e.g., date of birth and name, and house number).

In particular, matching addresses in consumer data may be useful in many situations, such as determining whether database records should be consolidated, deduplication of addresses for particular consumers, verifying address matches during the dispute process, or other situations. Matching addresses contained in a single field using conventional algorithms may result in over-matching, i.e., false positives, of virtually non-matching addresses having similar alphabetic and/or numeric values; and/or an under-match of addresses that actually match but are not detected as matching, i.e., a false negative. Thus, if false positives are included and/or false negatives are not included, the usefulness of search results that are further filtered based on address-based matches may be reduced. Furthermore, merging records based on false positives and/or false negatives of matching addresses may also result in incorrect database records.

Accordingly, there is a need for an improved system and method that can accurately match addresses and address formatting and quality issues of consumer data that may exist in emerging markets, in order to reduce over-matching and under-matching of addresses, among other things.

Disclosure of Invention

The present invention seeks to address the above problems by providing a system and method for matching addresses. In systems that manipulate address records, several database columns (e.g., country, province, city, street address lines, etc.) are typically used to store addresses. Although the contents of countries, provinces, and cities are generally explicit, street address lines are typically text in free format. When querying or comparing address records, the matching of street address lines is the biggest challenge of the matching algorithm.

Various embodiments of the present invention include a system and method for determining whether two or more addresses match by causing a processor to extract certain strings or tokens from the addresses into an address attribute container, and comparing the extracted attributes and associated modifiers in a hierarchical manner based on information built into the address attribute container. More specifically, in various embodiments, after verifying that the contents of the country, province, and city are matched, the address matching system and method of the present invention compares the street address portions of the addresses to determine whether two or more addresses match. Street addresses are considered alphanumeric free-form text in which some strings or tokens are defined as primary attributes and the strings or tokens surrounding each primary attribute are defined as associated modifiers. In one embodiment, the primary attribute of the address is a number in free-form text from the street address. The text strings on either side of each primary attribute are referred to as modifiers associated with the primary attribute. The address matching system and method of the present invention includes extracting primary attributes and their associated modifiers from a street address, storing the primary attributes and associated modifiers in an address attribute container, and comparing addresses in a hierarchical manner based on information built into the address attribute container.

In this embodiment, the first stage or layer of matching includes comparing the primary attributes of the addresses for mismatch determination. If at least one primary attribute of the target address conflicts with a primary attribute of the base address, a mismatch determination result may result and vice versa. The mismatch determination may be made immediately without regard to any other part of the address. If the primary attribute comparison results in a mismatch determination, the processor rejects the address as a mismatch. If the primary attribute of the first address does not conflict with the primary attribute of the second address, and vice versa, a second phase or layer of matching is initiated.

In this embodiment, the matching second layer includes comparing modifiers associated with each primary attribute. A modifier associated with a first primary attribute of a first address is compared to a modifier associated with a selected primary attribute of a second address. In some embodiments, the modifiers themselves are analyzed and fed back into the main algorithm to extract the primary attributes from within themselves based on the nature and complexity of the address data. If it is determined that the primary attributes no longer need to be extracted from the modifier, then the second layer also becomes the last layer in the overall address matching process.

It should be appreciated that the comparison threshold for modifiers varies based on address characteristics including primary attribute matching results and external factors such as market and address sources. For example, when comparing modifiers of addresses having stronger primary attribute matches (i.e., a higher number of matching primary attributes), the threshold is different than when comparing modifiers of addresses having weaker primary attribute matches (i.e., a lower number of matching primary attributes). The likelihood that the two addresses for which the higher number of principal attributes match exactly is the same is higher than for two addresses with weaker principal attribute matches.

These and other embodiments, as well as various permutations and aspects, will become apparent from and fully understood from the following detailed description and drawings, which set forth illustrative embodiments, indicative of the various ways in which the principles of the invention may be employed.

Drawings

FIG. 1 is a block diagram illustrating a system including an address matching engine.

FIG. 2 is a block diagram illustrating an address matching engine for matching addresses based on their primary attributes and associated modifiers and hierarchical matching analysis of the addresses.

FIG. 3 is a block diagram of one form of the computer or server of FIGS. 1 and 2 having memory elements with computer-readable media for implementing a system including an address matching engine.

FIG. 4 is a flowchart illustrating the operation of one embodiment of performing hierarchical address matching analysis of the present invention.

Fig. 5A-5C are diagrams illustrating the operation of one embodiment of performing hierarchical address matching analysis of the present invention.

Fig. 6 is a diagram illustrating the operation of an example embodiment of performing the hierarchical address matching analysis of the present invention as described in fig. 5B-5C. .

Detailed Description

The following description describes, illustrates, and exemplifies one or more specific embodiments of the invention in accordance with the principles of the invention. This description is not provided to limit the invention to the embodiments described herein, but to explain and teach the principles of the invention in such a way that one skilled in the art can understand these principles and, based on the understanding, can apply the principles to practice not only the embodiments described herein, but also other embodiments that are possible in accordance with these principles. The scope of the invention is intended to cover all such embodiments as fall within the scope of the appended claims, either literally or under the doctrine of equivalents.

It should be noted that in the description and drawings, similar or substantially similar elements may be labeled with the same reference numerals. However, at times these elements may be labeled with different numbers, such as where such labeling facilitates a clearer description, for example. In addition, the drawings set forth herein are not necessarily drawn to scale and in some cases the proportions may have been exaggerated to more clearly depict certain features. Such labeling and drawing practices are not necessarily involved in the underlying substantial purposes. As described above, the description is intended to be regarded as a whole and interpreted according to the principles of the present invention as taught herein and understood by those of ordinary skill in the pertinent art.

Various embodiments of the present invention include a system and method for determining whether two or more addresses match by: the processor is caused to extract certain strings or tokens from the address into an address attribute container and compare the extracted attributes with associated modifiers in a hierarchical manner based on information built into the address attribute container. More specifically, in various embodiments, after verifying that the contents of the country, province, and city are matched, the address matching system and method of the present invention compares the street address portions of the addresses to determine whether two or more addresses match. Street addresses are considered alphanumeric free-form text in which some strings or tokens are defined as primary attributes, while strings or tokens surrounding each primary attribute are defined as associated modifiers. In one embodiment, the primary attribute of the address is a number in free-form text from the street address. The text strings on either side of each primary attribute are referred to as modifiers associated with the primary attribute. The address matching system and method of the present invention includes extracting primary attributes and their associated modifiers from a street address, storing the primary attributes and associated modifiers in an address attribute container, and comparing addresses in a hierarchical manner based on information built into the address attribute container.

In this embodiment, the first stage or layer of matching includes comparing the primary attributes of the addresses for mismatch determination. If no primary attribute in the target address matches the primary attribute of the base address, a mismatch determination may result. The mismatch determination may be made immediately without regard to any other part of the address. If the primary attribute comparison results in a mismatch determination, the processor rejects the address as a mismatch. If the primary attribute of the first address does not conflict with the primary attribute of the second address, and vice versa, a second phase or layer of matching is initiated.

In this embodiment, the matching second layer includes comparing modifiers associated with each primary attribute. A modifier associated with a first primary attribute of a first address is compared to a modifier associated with a selected primary attribute of a second address. In some embodiments, the modifiers themselves are analyzed and fed back into the main algorithm to extract the main attributes from within based on the nature and complexity of the address data. If it is determined that the primary attributes no longer need to be extracted from the modifier, then the second layer also becomes the last layer in the overall address matching process.

Turning to fig. 1, fig. 1 is a block diagram of a search system including a matching engine according to one embodiment of the present invention. FIG. 1 illustrates a search system 100 for retrieving and matching database records, including embodiments of search queries and/or address matching in database records, in accordance with one or more principles of the present invention. The system 100 may utilize information derived from the free-form data source 104 loaded into the system 100 and/or information from a search query transmitted to the system 100 to return a set of records as a set of search results. System 100 may be part of a larger system, such as an international credit reporting system (iccrs) from the alliance (TransUnion).

The various components of the system 100 may be implemented using software executable by one or more servers or computers, such as the computing device 300 having the processor 202 and memory 204 as shown in fig. 3, which are described in more detail below. In one embodiment, the system 100 may perform a fine match on a set of initial search database records. The set of initial search records may be found from the database 108 by the search engine 106, and the matching engine 110, including the address matching engine 112, may further process the initial search records to find a set of more accurate results based on the initial search query. In another embodiment, the address matching engine 112 may perform a comparison on a set of database records based on the address in each of the records. For example, a comparison may be performed to determine whether records should be merged, or whether records match. A socket server (not shown) may be included in the system 100 to manage connections with client applications. Multiple requests may be sent through the socket server while the socket connection is maintained, or each request may require a new socket connection.

Application 102 may generate and initiate a search query to retrieve one or more results derived from data in free-form data source 104 from database 108. The search query may wish to retrieve records for a particular subject consumer. The application 102 may be, for example, a software application executing at a credit bureau and/or a member of a credit bureau (including financial institutions, insurance companies, utility companies, etc.), which desires to retrieve data related to the consumer (e.g., credit information) for licensing purposes. For example, when a consumer applies for a loan, a search query may be initiated by the bank so that the bank may examine the consumer's credit report to assess the consumer's credit value. The bank may enter personal identification information of the consumer in the search query in order to retrieve the credit report. The application 102 may transmit a message containing the search query to the system 100, and in particular, to the search engine 106. The message may be in a defined JSON (JavaScript object notation) format, extensible markup language (XML) or other format. Search results from the search engine 106 may be further refined by the matching engine 110 and the address matching engine 112. The refined results of the search initiated by the search query may be returned to the application 102 by the matching engine 110.

The free-form data source 104 may include raw consumer data that is not consistently formatted and/or structured. The consumer data may include identifying information about the consumer, and financial related data such as liability reimbursement status, on-time payment records, and the like. Consumer data in the free-form data source 104 may originate from various sources, such as from public records, e.g., contracts, bankruptcy records, etc.; and members of credit bureaus including financial institutions, insurance companies, utility companies, and the like. The free-form data source 104 may include minimal and/or incomplete identifying information in each record corresponding to a customer. The names and addresses in the free-form data source 104 may be arbitrary, ambiguous, and/or non-specific. For example, the address in the free-form data source 104 may include "near a kuntz (Guntur) train station," red house in the south of the jogger (Joggers) park, "or" go through the water tank 30 steps from village square. Such addresses may be valid and may receive mail, but are not specific, compared to the address formats used in developed markets. Each of the addresses may be included in an inconsistent number of fields and/or may be arbitrarily divided into a single field or multiple fields. Other data in the free-form data source 104 may be repetitive and thus not unique enough to clearly identify a particular consumer alone. For example, the same account number may be used for loan accounts for different consumers corresponding to different branches of the same bank. In this case, further identification information must be used to uniquely identify a particular consumer.

Raw data from the free-form data source 104 may be processed by the search engine 106 and placed in the database 108. In some embodiments, the raw data may be normalized by the search engine 106 and placed in the database 108. The search query to the search engine 106 may be used to retrieve a set of initial records from the database 108. In some embodiments, the search query may be normalized and/or transformed by the search engine 106 before being executed. Normalizing the raw data and search queries to a abbreviated normalization format may allow for fuzzy matching of the data. Some or all of the original data and search queries (e.g., name, address, date of birth, etc.) may be normalized. Normalization may include normalizing data using regular expressions using exact and template substitutions (pattern substitution) so that fields in a search query may match corresponding data in database 108 because both fields and data have been normalized.

The transformation of the search query may include applying alterations to the search query to allow the query to be broader and inclusive than specified in the original search query. The transformed search query may be sent with or without the original normalized search query. Transformation rules may be customized for a particular market associated with a free-form data source. Embodiments of the search engine 106 are disclosed in a commonly assigned non-transitory application entitled "System and method for subject identification from free-Format data sources (System and Method for Subject Identification From Free Format Data Sources)" (U.S. patent application Ser. No. 13/539,053), the entire contents of which are hereby incorporated by reference. Search engines utilizing any type of search algorithm may also be implemented in search engine 106.

The matching engine 110 and address matching engine 112 may process the search query and/or the set of initial records retrieved from the database 108 by the search engine 106. A refined set of search results that more accurately match the search query may be returned to the application 102 by the matching engine 110. Embodiments of the matching engine 110 are disclosed in a commonly assigned non-transitory application entitled "System and method for matching database records based on similarity to search queries (System and Method for Matching of Database Records Based on Similarities to Search Queries)" (U.S. patent application Ser. No. 13/538,926), the entire contents of which are hereby incorporated by reference herein.

FIG. 2 illustrates an address matching engine 112 that may match addresses to each other to evaluate similarity between addresses. In one embodiment, the address matching engine 112 may be used as part of the matching engine 110 that receives addresses in database records from the search engine 106, as described above. In other embodiments, the address matching engine 112 may be used alone or in combination with other systems to match addresses from any source, such as data files or other media. For example, the address matching engine 112 may be used to delete multiple duplicate addresses for a particular consumer or verify an address match during the dispute process.

Address matching engine 112 includes an attribute extraction engine 150, a primary attribute matching engine 152, and an attribute modifier matching engine 154. To perform analysis of free-form addresses, the address matching engine 112 receives two or more addresses in free-form text. The matching engine 112 may receive addresses from sources such as databases, search queries, files, or other media. The number of received addresses may vary depending on the desired use of the address matching engine 112. For example, the input address and one or more candidate addresses may be transmitted to the address matching engine 112 so that the input address may be compared to the candidate addresses. As another example, multiple input addresses and multiple candidate addresses may be transmitted to the address matching engine 112 such that each of the input addresses may be compared to the multiple candidate addresses.

After the addresses are received into the address matching engine 112, certain strings and tokens for each of the addresses are extracted and stored into the address attribute container 156 at the attribute extraction engine 150. More specifically, the attribute extraction engine 150 may deterministically evaluate the strings in the address and deconstruct the address to extract the primary attributes of the address and its associated modifiers. In one embodiment, the primary attribute of an address is any number within the address. Modifiers are strings or marks (i.e., characters, letters, symbols, spaces) on either side of each primary attribute. The extracted primary attributes of the input address and associated modifiers are stored in an address attribute container 156. Each primary attribute is linked to its corresponding modifier in address attribute container 156.

The first layer of the address matching system and method includes comparing the primary attributes of the addresses at the primary characteristic matching engine 152. More specifically, in a first stage or layer of matching, the dominant attributes are compared for mismatch determination. If at least one primary attribute of the target address conflicts with a primary attribute of the base address, a mismatch determination result may result and vice versa. The mismatch determination may be made immediately without regard to any other part of the address. If the primary attribute comparison results in a mismatch determination, the processor rejects the address as a mismatch.

On the other hand, if the primary attribute of the first address does not conflict with the primary attribute of the second address, and vice versa, then a second phase or layer of matching is initiated at the attribute modifier matching engine 154. More specifically, the associated modifier for the first primary attribute in the first address is compared to the associated modifier for the selected primary attribute in the second address.

In some embodiments, the first layer of the comparison has little weight in the case that the compared address does not contain any major attributes. More specifically, if the first address and the second address do not include any primary attributes (i.e., numbers), the primary attribute comparison does not result in a match or mismatch. In this embodiment, the number of strings and tokens are analyzed in modifier matching engine 154 to determine if there is a match. The comparison may include direct matching of data, matching characters from strings, matching and/or expanding acronyms or acronyms, configurable speech matching, perception and/or ignoring of noisy codewords (e.g., "and", "to", "loci"), configurable known replacement strings, fuzzy string algorithms, word stitching algorithms, diversification algorithms, numeric and non-numeric token analysis, and/or other techniques. In this embodiment, the second comparison layer has more weight than the first comparison layer. In some embodiments, modifiers of a primary attribute contain different layers of attributes that can be extracted in a modifier comparison, depending on the complexity of the free-form text in the address. In such embodiments, the modifiers are fed back to the attribute extraction engine 150 to recursively drive address matching, as illustrated in fig. 2.

FIG. 3 is a block diagram of a computing device 200 hosting executable software for facilitating the search system 100 and/or address matching engine 112. One or more examples of computing device 200 may be used to implement any, some, or all of the components in system 100, including search engine 106, matching engine 110, and address matching engine 112. The computing device 200 includes a memory element 204. The memory element 204 may include a computer-readable medium for implementing the system 100 and for implementing particular system transactions. The memory element 204 may also be used to implement the database 108. Computing device 200 also contains executable software, portions of which may or may not be unique to system 100.

In some embodiments, the system 100 is implemented in software as an executable program and is executed by one or more special purpose or general purpose digital computers (e.g., mainframe computers, commodity server, personal computer (desktop, laptop, or other), personal digital assistant, or other handheld computing device). Accordingly, computing device 200 may represent any computer in which system 100 resides or partially resides.

In general, with the hardware architecture shown in fig. 3, computing device 200 includes a processor 202, memory 204, and one or more input and/or output (I/O) devices 206 (or peripheral devices) communicatively coupled via a local interface 208. The local interface 208 may be one or more buses or other wired or wireless connections, as is known in the art. The local interface 208 may have additional elements omitted for simplicity, such as controllers, buffers (caches), drivers, transmitters, and receivers, to facilitate external communication with other similar or dissimilar computing devices. Further, the local interface 208 may include address, control, and/or data connections to enable internal communication between other computer components.

The processor 202 is a hardware device for executing software, particularly software stored in the memory 204. Processor 202 may be any custom made or commercially available processor such as, for example, core series (Core services) or vPro processors manufactured by intel corporation (Intel Corporation), or a nylon (Phenom), a flash (Athlon) or a flash (semtron) processor manufactured by ultra-micro devices corporation (Advanced Micro Devices, inc.). In the case where computing device 200 is a server, the processor may be, for example, a to-strong (Xeon) or Itanium (Itanium) processor from intel or a bright dragon (optron) series processor from the company of the ultra-fine device. Processor 202 may also represent multiple parallel or distributed processors working in concert.

The memory 204 may include volatile memory elements (e.g., random access memory (e.g., RAM of DRAM, SRAM, SDRAM, etc.) and nonvolatile memory elements (e.g., ROM, hard disk drive, flash drive, CDROM, etc.). Which may incorporate electronic, magnetic, optical, and/or other types of storage media. The memory 204 may have a distributed architecture in which various components are located remotely from each other, but still accessed by the processor 202. These other components may reside on devices located in other locations on the network or in a cloud arrangement.

The software in memory 204 may include one or more separate programs. The separate program includes an ordered listing of executable instructions for implementing logical functions. In the example of FIG. 3, the software in memory 204 may include system 100 and a suitable operating system (O/S) 212 in accordance with the present invention. Examples of suitable commercially available operating systems 212 are the Windows operating system available from Microsoft corporation (Microsoft Corporation), mac OS X available from Apple Computer, inc., unix operating systems from AT & T, or Unix derived products such as BSD or Linux. The operating system O/S212 will depend on the type of computing device 200. For example, if Computing device 200 is a PDA or handheld computer, operating system 212 may be iOS for operating certain devices from apple computer, palm os for devices from Palm Computing, inc, windows Phone 8 from microsoft corporation, android (Android) from Google, inc, or shift (Symbian) from nokia corporation (Nokia Corporation). The operating system 212 essentially controls the execution of other computer programs, such as the system 100, and provides scheduling, input-output control, file and data management, memory management, and communication control and related services.

If the computing device 200 is an IBM PC compatible computer or the like, the software in the memory 204 may further include a Basic Input Output System (BIOS). The BIOS is a set of basic software routines that initialize and test hardware at startup, initiate the operating system 212, and support data transfer between hardware devices. The BIOS is stored in ROM such that the BIOS may be executed when the computing device 200 is activated.

The steps and/or elements of the present invention and/or portions thereof may be implemented using a source program, an executable program (object code), a script, or any other entity comprising a set of instructions to be performed. Furthermore, software embodying the present invention may be written as (a) an object-oriented programming language having categories of data and methods, or (b) a programming language having routines, subroutines, and/or functions, such as, but not limited to C, C ++, C#, pascal, basic, fortran, cobol, perl, java, ada, and Lua. The components of system 100 may also be written in proprietary languages developed for interaction with these known languages.

The I/O device 206 may include an input device such as a keyboard, mouse, scanner, microphone, touch screen, bar code reader, or infrared reader. It may also include an output device such as a printer, video display, audio speaker or headphone port, or projector. The I/O devices 206 may also include devices that communicate with an input or output, such as a short-range transceiver (RFID, bluetooth, etc.), a telephone interface, a cellular communication port, a router, or other type of network communication equipment. The I/O device 206 may be internal to the computing device 200, or may be external and connected wirelessly or via a connection cable, such as through a universal serial bus port.

When the computing device 200 is in operation, the processor 202 is configured to execute software stored within the memory 204, to communicate data to and from the memory 204, and to generally control the operation of the computing device 200 in accordance with the software. The system 100 and operating system 212 may be read, in whole or in part, by the processor 202, buffered within the processor 202, and then executed.

In the context of this document, a "computer-readable medium" can be any means that can store, communicate, propagate, or transport the data object for use by or in connection with the system 100. The computer-readable medium can be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, propagation medium, or any other device having similar functionality. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic) having one or more wires, a Random Access Memory (RAM) (electronic), a read-only memory (ROM) (electronic), an erasable programmable read-only memory (EPROM, EEPROM, or flash memory) (electronic), an optical fiber (optical), and a portable compact disc read-only memory (CDROM) (optical). Note that the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a computer memory. The system 100 may be embodied in any type of computer-readable medium for use by or in connection with an instruction execution system or apparatus (e.g., a computer).

To connect to other computing devices, computing device 200 is equipped with network communication equipment and circuitry. In a preferred embodiment, the network communication equipment comprises a network card, such as an ethernet card or a wireless connection card. In a preferred network environment, each of a plurality of computing devices 200 on a network are configured to communicate with each other using the internet protocol suite (TCP/IP). However, it will be appreciated that various network protocols may also be employed, such as IEEE 802.11Wi-Fi, address resolution protocol ARP, spanning Tree protocol STP, or fiber distributed data interface FDDI. It will also be appreciated that while the preferred embodiment of the present invention has a broadband or wireless connection (e.g., DSL, cable, wireless, T-1, T-3, OC3, satellite, etc.) to the Internet for each computing device 200, the principles of the present invention can also be practiced with standard modems using dial-up connections or other connection means. Wireless network connections are also contemplated, such as wireless ethernet, satellite, infrared, radio frequency, bluetooth, near field communication, and cellular networks.

Turning to fig. 4, fig. 4 illustrates a flow chart of one embodiment of a process 400 for operating the address matching system and method of the present invention. In one embodiment, this process 400 is embodied in one or more software programs stored in one or more memories and executed by one or more processors or servers. Although this process 400 is described with reference to the flowchart illustrated in FIG. 4, it should be appreciated that many other methods of performing the actions associated with these processes may be used. For example, the order of certain steps described may be changed, or certain steps described may be optional.

In various embodiments, the process 400 may result in a determination that the input address and the candidate address are matching. In some embodiments, address match determination may help refine a set of search results from search engine 106 by merging matching database records. The address matching engine 112 may perform all or part of the process 400.

At step 402, an address may be received at the address matching engine 112. Addresses may be received from any source, and the number of addresses received at engine 112 may vary. In one embodiment, the address may be from a set of retrieved records of a search query received from the search engine 106 and/or the application 102. Records may have been retrieved by search engine 106 from database 108 based on search queries received from application 102. In another embodiment, the addresses may be from a source or application that wishes to compare one or more input addresses to one or more candidate addresses to determine their similarity.

At step 404, the primary attributes and associated modifiers are extracted from the address by the attribute extraction engine 150. More particularly, the attribute extraction engine 150 may deterministically evaluate strings in an address and deconstruct the address to extract the primary attributes of the address and its associated modifiers. In one embodiment, the primary attributes of an address are numbers within the address, and the associated modifiers are strings on either side of each primary attribute. For example, in the case of address lines, the numbers contained in the street address are considered primary attributes, and the characters (or lack of characters) on either side of each number are modifiers associated with each primary attribute.

At step 406, the extracted primary attributes and associated modifiers of the input address are stored into an address attribute container. Each primary attribute is associated with its corresponding modifier in an address attribute container.

At step 408, the primary attributes of two or more addresses are compared at the primary attribute matching engine 152. The primary attribute of the first address is compared with the primary attribute of the second address to determine if the two addresses do not have primary attributes (i.e., numbers) that conflict with each other. At step 410, the process 400 includes determining whether the primary attribute comparison exhibits a mismatch. If there is at least one conflicting primary attribute between two or more addresses, then the results are mismatched and process 400 includes rejecting the addresses as not matching as indicated by block 412.

On the other hand, if there are no conflicting primary attributes between addresses, then process 400 continues at step 414 to the second layer of address matching analysis, which includes comparing the associated modifiers at attribute modifier matching engine 154. More specifically, the associated modifiers of the primary attribute in one address are compared to the associated modifiers of the selected primary attribute in the second address.

At step 416, the process 400 includes determining whether the modifiers exhibit a match. In one embodiment, the process 400 determines that an address is a match if at least one modifier associated with each primary attribute of a first address matches at least one modifier associated with a corresponding primary attribute of a second address.

If the modifiers do not match, then the process 400 includes rejecting the address as not matching as indicated by block 418. On the other hand, if the modifiers match sufficiently, then process 400 includes returning the matching address, as indicated by block 420.

It should be appreciated that in various embodiments, the threshold for comparison of modifiers varies based on the characteristics of the address that includes the primary attribute matching result and external factors such as market and address source. For example, when comparing modifiers for addresses having stronger primary attribute matches (i.e., a higher number of matching primary attributes), the threshold is different than when comparing modifiers for addresses having weaker primary attribute matches (i.e., a lower number of matching primary attributes). The likelihood that the two addresses for which the higher number of principal attributes match exactly is the same is higher than for two addresses with weaker principal attribute matches. Thus, in some alternative embodiments, the threshold of comparison may require less matching between modifiers. For example, in some embodiments, two addresses may be considered a match as long as at least one modifier associated with at least one common primary attribute of a first address matches at least one modifier associated with a corresponding common primary attribute of a second address. In other embodiments, the comparison threshold may require more matches between modifiers.

Turning to fig. 5A, fig. 5A illustrates a diagram of a process 500 of extracting and storing primary attributes and associated modifiers at the attribute extraction engine 150. As depicted in fig. 5A, an address 502 of free-format text is received at a processor. The primary attributes and associated modifiers are extracted from the address, as indicated by arrow 504. The primary attributes of the address are extracted and stored in the primary attributes container 506. The modifiers associated with each primary attribute are extracted and stored into an attribute modifier container 508 associated with the respective primary attribute, as indicated by arrow 510. Fig. 5B and 5C illustrate diagrams of one example embodiment of the process 500 depicted in fig. 5A. Fig. 5B illustrates a process 500 with respect to a first address 602, and fig. 5C illustrates a process 500 with respect to a second address 702.

Turning to fig. 5B, fig. 5B illustrates the processor receiving the first address 602 in free-form text. In this example, the first address 602 is "MZ#15-54JOROBA C84LT 4". Arrow 604 represents extracting the primary attribute and associated modifier from the address and storing the primary attribute and associated modifier in the address attribute container. The primary attributes (i.e., numbers in the address) are extracted and stored into primary attribute container 606, which contains attribute elements 606A, 606B, 606C, and 606D. Thus, the extracted main attributes are: (a) 84 in attribute element 606A, (B) 54 in attribute element 606B, (C) 15 in attribute element 606C, and (D) 4 in attribute element 606D.

The modifiers associated with each primary attribute are strings or tokens on either side of each primary attribute. The modifiers associated with each primary attribute are extracted and stored into an attribute modifier container linked to the corresponding primary attribute. Thus, for the primary attribute "84", the associated modifiers (i.e., characters on either side of the number) are [ "JOROBA", "C" ] and "LT" stored in the attribute modifier container 608A. The primary attribute element 606A is linked to the attribute modification Fu Rongqi 608A, as indicated by the linking arrow 610. For the primary attribute "54", the associated modifiers "-" (for hyphens) and [ "JOROBA", "C" ] are stored into the attribute modifier container 608B, and the attribute modifier container 608B is linked to the primary attribute element 606B by the link arrow 610B. For the primary attribute "15", the associated modifiers [ "MZ", "# ] and" - "(for hyphens) are stored into the attribute modifier container 608C, the attribute modifier container 608C being linked to the primary attribute element 606C by a link arrow 610C. For the primary attribute "4", the associated modifiers "LT" and "Null" are stored into the attribute modifier container 608D, and the attribute modifier container 608D is linked to the primary attribute element 606D by the link arrow 610D.

Fig. 5C illustrates a process 500 of operating on a second address 702. More particularly, illustrated in fig. 5C, the second address 702 is received by the processor in free-format text. In this example, the second address 702 is "J84 C#54-15-4". Arrow 704 represents extracting the primary attributes and associated modifiers from the address and storing the primary attributes and associated modifiers in an address attribute container. The primary attributes (i.e., numbers in the address) are extracted and stored into the primary attribute container 706 containing attribute elements 706A, 706B, 706C, and 706D. Thus, the extracted main attributes are: (a) 84 in attribute element 706A, (B) 54 in attribute element 706B, (C) 15 in attribute element 706C, and (D) 4 in attribute element 706D.

The modifiers associated with each primary attribute are extracted and stored into the following attribute modifier container: (a) For the primary attribute "84", the associated modifier "J," [ "C", "# ] is stored in the attribute modifier container 708A, the attribute modification Fu Rongqi a is linked to the primary attribute element 706A; (b) For the primary attribute "54", the associated modifiers [ "C", "# ] and" - "(for hyphens) are stored into the attribute modifier container 708B, the attribute modifier container 708B being linked to the primary attribute element 706B by the link arrow 710B; (c) For primary attribute "15", the associated modifiers "-" and "-" (for hyphens) are stored into attribute modifier container 708C, attribute modifier container 708C is linked to primary attribute element 706C by link arrow 710C; and (D) for the primary attribute "4", the associated modifiers "-" (for hyphens) and "Null" are stored into attribute modifier container 708D, with attribute modifier container 708D linked to primary attribute element 706D by link arrow 710D.

After extracting the primary attributes and modifiers from the first address and the second address and storing the attributes in the address attribute container, the address matching system and method includes comparing the primary attributes at the primary attribute matching engine 152. The primary attributes of two or more addresses are compared to determine if there is a mismatch. In this example embodiment, the primary attributes of the first address 602 all match the primary attributes of the second address 702, so there is no mismatch.

It should be appreciated that it is not necessary that all major attributes be matched for the presence of a "no mismatch" determination. If the primary attribute comparison does not result in a mismatch determination, the primary attribute comparison will result in determining one of the different levels of "subset" relationships. In various embodiments, the comparison of the primary attributes presents a good resolution of the different levels of similarity between the primary attributes of the address. For example, a comparison of the primary attributes of two or more addresses may exhibit exact matches (i.e., (1, 2,3 vs. 1,2, 3)) or different levels of "subset" relationships (i.e., (1, 2,3 vs. 1, 2) and (1, 2,3 vs. 3)).

Turning to fig. 6, fig. 6 illustrates a diagram of address comparisons from fig. 5B and 5C. As illustrated, the primary attribute elements 606A, 606B, 606C, 606D of the first address 602 are compared with the primary attribute elements 706A, 706B, 706C, 706D of the second address 702. In this example embodiment, since the primary attribute comparisons do not result in mismatches in the first layer of matching, modifiers associated with each primary attribute are compared at the attribute modifier matching engine 154 in the second layer of matching.

In the matching second layer, the modifier associated with the first primary attribute element 606A of the first address 602 is compared to the modifier associated with the corresponding primary attribute element 706A (if present) of the second address 702. More specifically, for the first address 602, for the primary attribute element "84"606A, the associated modifiers are [ "JOROBA", "C" ] and "LT" indicated by the modifier address attribute container 608A and by the link arrow 610A in FIG. 5B. For the corresponding primary attribute element "84"706a of the second address 702, the associated modifiers are "J" and [ "C", "#" ] indicated by the modifier address attribute container 708A and the link arrow 710A in fig. 5C. The modifier associated with "84" of the first address 602 is compared to the modifier associated with "84" of the second address 702.

For the common primary attribute "84", both addresses 602, 702 contain a common modifier "C" and partial string matches between "JOROBA" and "J". Thus, this first comparison results in a match. Similarly, for the second common primary attribute "54", both addresses 602, 702 include common modifiers "-" and "C". For the third common primary attribute "15", both addresses 602, 702 contain a common modifier "-". For the fourth common primary attribute "4", both addresses contain a common modifier "Null". In this embodiment, two addresses are considered to be matched because the associated modifier for each of the primary attributes results in at least one match.

It should be appreciated that in various embodiments, the threshold for comparison of modifiers varies based on the address characteristics including the primary attribute match results and external factors such as market and address source. In this example embodiment, for each primary attribute, at least one modifier associated with the primary attribute from the first address matches at least one modifier associated with the corresponding primary attribute from the second address. In some alternative embodiments, the comparison threshold may require fewer matches between modifiers. For example, in some embodiments, two addresses may be considered a match as long as at least one modifier associated with at least one common primary attribute of a first address matches at least one modifier associated with a corresponding common primary attribute of a second address. In other embodiments, the comparison threshold may require more matches between modifiers.

It should be appreciated that in various embodiments of the present invention, weighting factors are applied to different layers of the comparison method. Other address matching methods include assigning different weights based on a predetermined order of importance of different components of the address. One problem with this approach is that different components of the address are emphasized in different locations, different countries and regions. In contrast, no order of importance is predetermined in the present invention. The first layer performs the primary attribute comparison and naturally obtains the highest weight. Likewise, the subsequent layers correspondingly have their weighting factors. The modifier comparison adjusts the upper layer, i.e., the primary attribute comparison. For example, if the compared address does not contain a number, the first layer (comparison of the primary attributes) is given less weight. In this embodiment, the primary attribute is not extracted and thus there is no mismatch determination in the compared first layer. In this embodiment, the second layer of comparison (i.e., the comparison of modifiers) is given more weight than embodiments in which there are many primary attributes of the match. It should be appreciated that the weight given to the primary attribute may also vary depending on the comparison. For example, when comparing two addresses with many numbers in an address, the primary attribute extraction and comparison will be more decisive than when comparing two addresses with fewer numbers.

As described above, any process descriptions or blocks in the drawings should be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of embodiments of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art.

It should be emphasized that the above-described embodiments of the present invention, particularly, any "preferred" embodiments, are possible examples of implementations, merely set forth for a clear understanding of the principles of the invention. Many variations and modifications may be made to the above-described embodiments of the invention without departing substantially from the spirit and principles of the invention. All such modifications are intended to be included herein within the scope of this disclosure and the present invention and protected by the following claims.

Claims

1. A method of matching a first address and a second address, the method comprising:

an address matching engine arranged for communication with a processor and a database accessible to a computer network, the database comprising a plurality of unstructured, incomplete and/or non-uniformly formatted data from a free-form data source, the data comprising the first address and the second address;

In response to receiving, via the processor, the first address and the second address communicated over a computer network to the address matching engine to determine whether the first address and the second address match, the following is performed:

extracting, by the address matching engine, any primary attributes of the first address, any primary attributes of the second address, any modifiers associated with the extracted primary attributes of the first address, and any modifiers associated with the extracted primary attributes of the second address via the processor based on deconstructing strings in the first address and the second address;

comparing, by the address matching engine, the extracted primary attribute of the first address with the extracted primary attribute of the second address via the processor to determine (1) whether the first address and the second address include one or more common extracted primary attributes, and (2) whether the first address and the second address include any conflicting extracted primary attributes; and

in response to determining that (1) the first address and the second address include commonly extracted primary attributes and (2) the first address and the second address do not include any of the conflicting extracted primary attributes, performing, for each of the commonly extracted primary attributes determined:

Comparing, via the processor, the extracted modifiers associated with each of the common extracted primary attributes of the first address with the extracted modifiers associated with selected extracted primary attributes of the second address by the address matching engine to determine whether a number of matched modifiers exceeds a modifier comparison threshold to indicate that the first address matches the second address; and

in response to determining that the number of matched modifiers exceeds the modifier comparison threshold, a message is transmitted by the address matching engine via the processor indicating that the first address matches the second address.

2. The method of claim 1, wherein the primary attribute is a number.

3. The method of claim 2, wherein the modifier is a string on either side of the respective primary attribute.

4. The method of claim 1, wherein extracting includes deterministically evaluating, by the address matching engine, a first string in the first address to identify a primary attribute of the first address and deterministically evaluating, by the address matching engine, a second string in the second address to identify a primary attribute of the second address via the processor.

5. The method of claim 1, wherein the extracted primary attributes are stored by the address matching engine into a primary attribute container via the processor.

6. The method of claim 5, wherein the modifiers extracted are stored into attribute modifier containers such that each attribute modifier container includes the modifiers extracted associated with the same primary attribute element.

7. The method of claim 6, wherein a first one of the primary attribute elements in the primary attribute container is linked by the address matching engine via the processor to the attribute modifier container containing the modifier associated with the first primary attribute element.

8. The method of claim 2, wherein the primary attribute includes an apartment number, a house number, a post office box, a floor, a building, a complex, a street, a geographic direction, a region, a zone, a counter number, a suburb, a village, a suburb, a town, a city, or a country.

9. The method of claim 1, further comprising transmitting, by the address matching engine, a mismatch notification via the processor when there is a conflicting extracted primary attribute between the first address and the second address.

10. The method of claim 9, wherein a mismatch determination is made without comparing any of the extracted modifiers of the first address with any of the extracted modifiers of the second address.

11. The method of claim 1, wherein the modifier comparison threshold is based on a number of co-extracted primary attributes.

12. The method of claim 1, wherein the modifier comparison threshold is based on a source of the free-form data source.

13. The method as in claim 1, further comprising:

assigning, by the address matching engine, a first weighting factor associated with the extracted primary attribute via the processor based on a number of primary attributes extracted from the first address and the second address; and

assigning, by the address matching engine, a second weighting factor associated with the extracted modifier based on the first weighting factor,

wherein the modifier comparison threshold is based on the first weighting factor and the second weighting factor.