UPRN address matching algorithm: Difference between revisions

Revision as of 11:30, 4 June 2020

Overview

Address matching algorithm uses human semantic based pattern recognition, manipulation and matching judgement as it's matching technique, supported by a few machine based algorithms such as the Levenshtein distance algorithm

Firstly, some definitions of terms used:

Term	Description
Candidate address	The address string submitted by a user or a subscriber system for matching
Standard address	An address that an organisation, considered to be an authority, has stated as referring to a real location.

Matching a candidate address to a standard address consists of a process whose objective is to reach a high level of confidence that the candidate address refers to the same location as the standard address.

When attempting to match a candidate address to one standard address from a set of standard addresses there are two objectives. If both objectives are achieved, the address is said to be matched. The objectives are:

To reach a high level of confidence that the candidate address refers to the same location as the standard address.
To judge that the standard address that has been matched to, is the most likely from all of the addresses in the set to refer to the same location.

It can be seen that the objectives include two relative measures, confidence and judgement. A question arises as to whether it is possible for these measures to be mathematically or statistically based, or not.

It does not take long to show the fundamental problem with address matching.

Consider the following candidate and standard addresses

Candidate   :Flat 1,15 high street, YO15 5TG
Standard    :Flat 1,15 high street, YO15 5TG

By any computable measure, e.g. length of string, character position match, it can be seen that these two addresses are identical. One can easily deduce from this that there will be a high level of confidence that they refer to the same location. Also, as it is not possible for any other address to be more similar, a sound judgement can be made that this is the most likely from all of the addresses, except for the other identical addresses in the set.

Consider the following

Candidate  :Flat 1 ,15 high street YO15 5TG
Standard   :Flat 1, 15 high street YO15 5TG

One can see that they are different. Whilst both have the same characters, the position of character 7 and 8 have been transposed. Yet a human being would almost certainly say they refer to the same place. Also, unless there is another address that is exactly the same as the candidate, the match is the most likely.

Consider the following:

Candidate  :Flat 1,15 high street YO15 5TG
Standard   :Flat 11, 5 high street YO15 5TG

Like the above example there has been a transposition at position 7 and 8. Yet they are obviously different addresses. They are very unlikely to refer to the same location unless there were no other flats or numbers on the street. Even If there were no other numbers or flats on the street then the confidence level may still only be moderate. Not enough to match?

So it can be seen that in fact it is semantics that drive matching judgements and not just positional variations and character mismatches. Likewise:

Candidate :15 Flat a High Street YO15 5TG
Standard  :Flat a 15 high Street YO15 5TG

These two addresses are very likely to be the same unless a closer fit can be found. A closer fit is NOT a closer fit simply by character matching.

Candidate :15 Flat a High Street YO15 5TG
Standard  :1 Flat 5a high street YO15 5TG

These are also quite close but clearly semantically different.

So it can be seen that in nearly all cases, when comparing one address with another, or an address against a set of addresses, it is the semantic interpretation of the address that determines the match. Human based semantic interpretation is still more reliable than AI for language-based judgements.

It follows that address matching rules are nothing more than trying different addresses on for size and see which one a human being thinks means the same or a similar thing.

Pattern recognition and manipulation

What does the computer do? Basically it computerises the process of human pattern recognition and human suggested string manipulation.

Firstly, we know from our own experience that words that are similar or the same words in different orders often mean the same thing. We also know that letters and numbers can be transposed without too much loss of meaning. We also know that misspelled words when corrected usually mean the same thing.

For example, we make a judgement that the word 'flr', in the context of an address, is likely to mean ‘floor’. In a cookbook though it might mean “flour”. Also we know that the same meaning can have different words or spelling such as 1^st, and first. Often different punctuation means more or less the same '5/6' and '5-6' etc.

From this knowledge we can set pattern recognition rules. We can say that full stops can be removed or '/' replaced with '-' because we know that they are unlikely to affect the semantics.

However, there are always twists. It is reasonable to infer that 'st' is likely to be short for 'street'. However, the phrase 'St Katherine's way' implies a saint, not a street. Therefore the manipulation rules must also take account of potentially wrong manipulations. In this case, a 'street' can be recognised by checking the resulting expansion e.g. ('high st -> high street') against a standard street index, or the position of the 'st' in the string. These patterns and manipulations are coded and validated and adjusted if wrong manipulations are discovered.

Human judgement based manipulations can result in false positives. As the algorithms are developed the introduction of false positives must be checked regularly and the manipulation rules adjusted.

Rules,based on knowledge of the world can be quite clever. Knowing that a “top floor flat” is more likely to be the same as “flat c” from a list of a, b, c than “flat a”, gets a preference from a set of options.

Address components

Addresses have more than a dozen semantically different components. Name of a flat, number of a flat, number and letter of a flat, range of numbers for a flat, a building name, a building number, a building number and letter, a range of numbers with or without letters, a dependent thoroughfare, a street, a dependent locality, a locality, a town, a city and a post code. On top of this there are the descriptions in relation to the front or back of the building, the level within a building or whether facing east or west, north or south. There are many words to describe flat like units, maisonettes, studios, houses, apartments and so on.

With address labels with up to around 10-12 semantically different field meanings and with human tendency to place words in the wrong fields and in the wrong order in the wrong fields, or inadvertent field separators means that field allocations cannot be relied on. manipulation rules take account of wrong allocations. In some cases users submit address with everything in one field. Manipulation must take account of no user determined address fields at all.

With each manipulation comes a human judgement that the resulting string is still a true reflection of the candidate and that a match to a standard would be correct. It is only human judgements that can be used to match addresses.

There are a few useful mathematical techniques. Pluralisation and de-pluralisation. Use of Levenshtein distance algorithm with allowable distances roughly proportional to the length of the words can help. Partial matching of words with front part matching (as humans often get the end of phrases wrong rather than the start) helps also.

However, these are just techniques to speed up what would otherwise be a manual process relying in eyesight, concentration and a high boredom threshold.

Conclusion on approach

A computer only speeds up the process of pattern recognition and string manipulation that would otherwise by done by a human being, each manipulation being specifically undertaken with a view to enable a visual check that would result in a match.

If a particular manipulation falls short of something that a human knows would match, the manipulation is abandoned and another one tried. The usual approach is to start with part of the address that seems to match, make small manipulations initially, then if no matches are found, increase the degree of manipulation. There is no useful machine-based algorithm based on a purely mathematical functions. Whilst a mathematical function like transpositions may highlight likely character matches (and there is an association between character matching and semantic meaning) relying only this approach would reduce possible semantic matches achieved by larger distance comparisons using human judgement.

The most significant problem is what to do when manipulations build on manipulations that are already dubious. I this case the level of confidence drops at this point and the process can start again from a different starting point, repeated until the algorithm author has judged that one set of manipulations give higher confidence than another. This results in a series of rankings for each set of manipulations. If the rankings are in the wrong order, then a premature match might be made, resulting in a false positive match. Adjusting the ranking uses the same type of judgement as adjusting the manipulation.

Address matching processes

Firstly some of the terms used are defined. This table lists the main terms used in the description

Term	Description
Subscriber	An organisation (or system that is operating on behalf of an organisation) that is seeking to access data from the Discovery service
Standard address	An address provided by an authoratitive body that has got an assured link to a UPRN
Candidate address	The address string submitted by a user or a subscriber system for matching
UPRN	Unique property reference number as supplied by the ordnance survey organisation
Address based premium	The set of databases as provided by the Ordnance survey holding the UPRNs and several addresses for each
Equivalent match	The UPRN is deemed to be equivalent to the property as described by the submitted address
Sibling match	The UPRN is assigned to a property nearby e.g. next door
Sub-property match	The UPRN is assigned to a â€œhigher level property than the submitted address. For example if the submitted address wasflat 1 and the UPRN was assigned to flat then this would be a supra-property match
Sub property match	The UPRN is assigned to a lower levelproperty than the submitted address. For example if the submitted address was flat 1 and the UPRN was assigned to flat 1 a then this would be a sub property match
DPA file	The Post office Delivery point address file
LPI file	The Local Property identifier file

The Standard address database

Source and nature of standard addresses

The Ordnance Survey (OS) has a product; ‘Address Base Premium (ABP)’ which contains information about every UPRN. It contains three main files, with each entry in each file assigned directly to a UPRN

A Delivery Point Address file (DPA) which lists all the Royal Mail addresses sourced from Royal Mail’s PAF (Postcode Address File) which is a non-geocoded list of addresses that mail is delivered to.
A Land Property identifier file (LPI) which lists geographical addresses as maintained by contributing Local Authorities. They represent the legal form of addresses as created under street naming and numbering legislation. The structure of a geographic address is based on British Standard BS7666.
A Basic Land Property Unit file (BLPU) which lists all the unique properties, UPRN, post code and geographical coordinates.

There can be more than one version of the LPI geographic address for a UPRN, capturing approved, alternative, provisional and historical (inactive) versions of the DSI exclude status 8. All are included as potential matches

address. All are available for matching.

These files are considered as the standard address details for each UPRN. Whilst each file represents addresses in a slightly different way, they are aligned with each other. The LPI file contains entries that are often more granular than the DPA file. Some UPRNs only exist in the LPI file and not in the DPA file and vice versa. Thus the LPI file can usually be considered as the more definitive of the two but a match to either can be considered correct.

In addition to a difference in the level of detail in the two files there is sometimes a relationship between one UPRN and another in that one UPRN can be a “parent” UPRN of a more detailed property. For example a parent UPRN may be assigned to a building whereas a child UPRN is assigned to a flat within a building. Both the parent and the child UPRN may exist in the DPA file and the LPI file.

In effect this means that there are two sources of the truth for an address, and matching to either source provides the UPRN. It is important to map to a child UPRN where possible as the lowest level detail better represents actual households.

Loading addresses, formatting, indexing and custom indexes

Address classes

Addresses are loaded into the database from source (See green classes).

Each source standard address is reformatted into a standard address object, which is an instance of a class that extends the address class by dint of having a UPRN and status. The objective of the reformatting is to produce a single model of an address for matching to which candidate addresses will also adhere.

Format also includes standardisation of the use of suffix and ranges e.g. 1-2, 1a, 1a-1f.

Formatting of the standard address also includes 'fixing' some errors and removing some extraneous words that are unnecessary in the matching process. These include:

Spelling corrections, modified by context
Replacement or removal of punctuation and lower casing
'Flat' removals, involving the removal of terms from the flat field that signal a flat are can be considered equivalent e.g. flat, apartment, towers, rooms, flat no, unit workshop, maisonette. Some special flat terms are retained for potential removal later in a leaf manipulation e.g. 'studio'

Addresses are then indexed in a number of ways:

Secondary index on post code street and building name columns
Multiple Composite indexes on post code, street, number, building flat & post code building & building flat and post code
Functional indexes on the above including concatenation, de-pluralisations

Once Loaded, reformatted and super-indexed. The address database is ready to be used

Candidate address pre-formatting

It is assumed that an address is submitted as one or more delimited fields with the post code at the end of available.

Optionally, an area based qualifier is included, which is useful of there is no post code. For example a practice post code or simply a list of major postcodes. These narrow down the options when checking against 100 million+ addresses.

Candidate addresses are subject to the same reformatting rules as standard addresses (spell checking etc). However this needs to be done in two stages

Before the address fields are populated
Once more after the address fields are populated

Using human example driven pattern recognition a set of manipulation rules are followed that result in an address object being populated. There are hundreds of rules to follow. Many rules manipulate data when new patterns are recognised following from previous manipulations.

For example, Let us assume the following pattern and manipulation occurs

flat 1 St Paul's house 15 high street

Pattern recognition= street preceded by number manipulation = populate number and street

number_street = 15 high street

However, consider a different example that requires additional manipulating

flat 1 St Paul's house 14- 15 high street

Pattern recognition= street preceded by number manipulation = populate number and street resulting in

number_street = 15 high street

Which is wrong, and therefore a subsidiary pattern of two numbers, with or without a dash, making sure that there is a flat number already allocated, and the number street is therefore

number_street= 14-15 high street

Iterative Pattern recognition together with manipulation results in >100 variations which never complete the number of possible manipulations needed.

The curve of the number of manipulations needed to improve the match detection rate by a certain percentage is exponential.

Matching candidate to standard addresses

Match failure circumstances

It is not always possible to match a Discovery address exactly to a specific UPRN. There are 3 reasons why a Discovery address may not match

It is a false address i.e. the address does not actually exist in reality and cannot therefore be matched
There may not be sufficient information in the candidate address for it to match at the level of detail required.
The Discovery address may contain more detail than the standard address and is therefore too accurate

There are missing entries in the DPA and LPI file. The files are constantly being updated and released every 6 weeks. There is a massive amount of building going on. Even changes to post codes can occur and the DPA file (which contains the post code) may be out of date or wrong.

Qualified matching

As well as matching to a UPRN there is also requirement to qualify the relationship between a Discovery address and the UPRN. We refer to these qualifiers as “approximation qualifiers” as they mean that the UPRN is geographically close to the property. Each of the main 5 address fields is assigned a qualifier. The five qualifiers are:

Equivalent match i.e. we believe the property is the property that the UPRN refers to
Near match. We believe the property is the property that the UPRN refers to but the flat or property may be split. This only applies to matches where the flat or number suffix in either the Discovery or the ABP addresses may be ignored or dropped. This is in effect equivalent.
Child - Sub- property. The Discovery address represents a sub-property of the UPRN property. For example the Discovery address “flat 1a Eagle house” may only match to the higher level DPA or LPI entry of “Eagle house”
Parent. The Discovery address represents a supra-property (parent) of the UPRN property. For example the Discovery address ‘1 Angel Lane’ does not have an equivalent in DPA or LPI but ‘flat 11, 1 Angel Lane’ does exist so it matches to a more detailed identifier qualified as a supra-property
Sibling. The Discovery address may represent a sibling of the UPRN property. For example the Discovery address ‘flat 12 , 1 Angel Lane’ may not exist in either ABP file but ‘flat 11, 1 Angel Lane’ does

Qualifiers are assigned in relation to the final post manipulated address match.

It should be noted that approximation qualifiers are used only when the level of matching arrives at the level of a street number or a more detailed approximation to a building or flat. Simply matching to a street is not considered a match and the address.

Poor quality addresses

Candidate addresses are checked for quality. A poor quality address is more likely to remain unmatched, but a quality indicator is assigned whether matched or not. Poor quality indicators include:

null address lines. i.e. all address lines are null
The entire address line is too short (<9 characters)
The post code is missing
The post code is in an invalid format

Matching algorithms

To formalise the judgement a ranking can be applied to each set of rules that perform a match e.g. (Rank 1 – Rank 80)

A Rank 1 rule would represent an exact match of all fields (taking into account DSBefore

GHIs address cleaning carried out before running the algorithm, or on the fly as part of the algorithm?

spelling corrections and other standard text cleaning methods). A Rank 80 would represent an approximate match based perhaps on fuzzy matching of 2-3 fields and a parent approximation.

By applying the rules in Ranking in order, this avoids matching prematurely with a fuzzy match when a better match exists. It is the false GHYes, address matching is susceptible to high false positive rates

positives that create the problems as the matching gets harder.

This matching process can be considered as a series of recursive reductions in the list of probable matches, recursion to continue until the list can be narrowed to the maximum extent possible. At that point a judgement is made as to whether there is a match with a qualifier, or no match. In addition, during the matching process, performance enhancements are made to enable short cuts.

The starting point is that a submitted address has a potential list of matching candidates of the entire ABP address database. However, it is impractical to match a submitted address by re-ordering the entire ABP address list each time. Therefore one prior assumption is applied. That is that the submitted address must have a DSTalking cross purposes. This is not derived from he address post code but from the requestors system when they make the request. A practice in Cardiff will know their area codes. I suggest they make one request batch per area. It is a vital performance thing I think. So in East London there would be one for ‘e’, one for ‘ec’ one for ‘nw’ etc

GHDo you mean the postcode district must match?

This narrowing of the potential matches by postcode district is called ‘blocking’ and is good practice. However, I have found matches using only the address and not the postcode, but very small number

post code district e.g. ‘E’ or ‘EC’ or ‘YO’.

The algorithms described in this specification consider how to optimise the sort process without creating false approximate matches.

=== Address Fields for matching ===ABP address files contain quite granular field definitions. Discovery address files consist of 1-5 address lines and a post code with each address line not necessarily reflecting the correct address line or field values.

The Discovery address is DSSent you the algorithm in mumps th other week. There have been some enhancements. The parsing algorithms are just as complex as the matching algorithms but more useful. By pre-cleaning the address (on the fly) it massively reduces the number of matching algorithms needed

GHHow do you do this? Do you use parsing or just a lookup?

parsed into the logical address fields represented by the ABP files.

It is assumed that the reader has read and understood the ABP file documentation.

The following fields are taken into account when matching.

Fig 2 – DSThat’s the parsing algorithm. These fields are derived from the rubbish that is sent. At the moment it assumes at least address lines (e.g. 1, 2,3 or 4). Nevertheless , If they are going to be sent as one block then we will need to extend the address parsing to cope. I don’t think that will be a problem though as in any event the matching algorithms shuffles the fields. In other words, best guess improves performance but algorithms take account of them being rubbish

GHIf this algorithm is potentially going to be used for address matching other address data in the future, e.g. local authority addresses, this is rarely separated into address lines, but comes as one concatenated address. How would you parse this?

Address fields for matching

Discovery formatted field	ABP equivalents
Post code	DPA -> BLPU – GHThis should come from the BLPU tablePost code LPI –> BLPU Post code
Locality	DPA – locality LPI –> Street Descriptor-> Locality
Dependent locality	DPA- dependent locality
Street	DPA- thoroughfare LPI – > Street Descriptor / Street
Dependent thoroughfare	DPA- Dependent thoroughfare
Number	DPA – Building number LPI – Primary address object – number or range made up of Pao start+ pao suffix ..{+- “-“} +pao end + pao end suffix e.g. 45a-56b 45
Building name or house name	DPA – Building LPI – Primary address object – pao text
Flat number and flat number prefix	DPA – sub building LPI – Secondary address object number or range made up of either Sao start+ sao start suffix +- {“-“} + sao end + sao end suffix e.g. 1a- 1b 1a 1 Or sao text e.g. flat 1

=== Field and content manipulations ===The matching algorithms dictate how the fields and their contents are manipulated to match to the ABP.

==== Spelling and word replacement ====As part of the algorithms there are a set of generic methods that can be applied to content at different times in the GHCheck if I have any additional

algorithms

  1) Lower case if case dependency 
  1) Removal of ‘ with null 
  1) Replacement of . with ‘ ‘ ( replace full stop with space) 
  1) Replacement of / with - (replace slash with dash for flat ranges 58/60 58-60 
  1) Leading and trailing spaces removed 
  1) Replacement of ‘ ‘ with ‘ ‘ i.e. double space with space 
  1) Spelling DSBoth abbreviations and some spellings

GHAddress abbreviations corrections (e.g. flt-> flat bld ->building ) from spell correction list

  1) Word swapping for equivalent words (e.g. house-> building, apartment -> building, nursing-> care, upper->first) 
  1) Plural/ singular correction i.e. removal of trailing ‘s’

For a the flat field starting with ‘flat’ ‘DSYes unit workshop

GHAnd ‘apartment’

room’ etc then these can be removed as they are used inconsistently.

i.e. Flat field = ‘Flat 1a’ and Flat field = ‘1a’ and Flat field= ‘room 1a’ would be considered equivalent

==== Levenshtein distance ====For non-exact rankings, equivalent spellings are checked using the GHCan you give examples please?

Thorngrove road, thrngrove road. Levenshtein distance = 1

Levenshtein distance algorithm whereby

  1) A distance of 1 is considered acceptable with a minimum phase length of 10 
  1) A distance of 2 with a minimum phrase length of 10 is acceptable 
  1) A distance of 3 with a phrase length of > 9 would be acceptable 
  1) Parameterised maximum distance and length e.g. post code would be 2, with a minimum length of 5

==== Formatting addresses ====In order to optimise matching both the inbound Discovery addresses and the ABP candidate addresses should be formatted to the same address class using the same formatting algorithms.

Discovery addresses can be formatted when submitted. However, as a performance improvement DSI meant creating things like 19a-19b from the secondary address fields to improve performance. It isn’t logically required, just a performance boost (10 fold if used in indexes)

DS

GHDoes ABP need formatting? Can you give an example?

ABP addresses should be pre-arranged to the same common format (as described above) and stored before any address is submitted.

Methods such as character and space removal and spelling corrections and flat suppression should be applied during the formatting before storage. Other algorithms such as plural/ singular may be applied to indexing and other algorithms such as word swapping Levenshtein distance applied to Discovery addresses at run time.

N.B Towns and cities may be GHAgreed.

ignored as it is assumed that the address submission interface also contains an area location parameter list such as the district part of the post code (e.g. e+ ec + for east and central London)

==== Creating address fields from LPI and DPA files ====The flat, building, number, dependent thoroughfare, street, dependent locality, locality and post code can be created from the LPI and DPA files using the field specification.

For example where an LPI has secondary address fields, of start, suffix, end ,suffix:

e.g. saos=1, saosf=a, saoe=4, saoef=4b then the flat ‘1a-4b’ range may be created

==== Creating address objects from address lines ====This function applied both to Discovery addresses (variable number of lines) and ABP addresses (fixed number of lines)

Firstly deduce the start and end line count in order to avoid combining town or city lines into the core 4 fields.

Return Class is Address containing , Flat, Building, number, street, post code

$format (address lines, address) // address lines, at least 2 with the last being the post code

Lower case the address

Strip double spaces, and remove /.’, characters

Line count is the number of lines -1 (i.e lines before post code)

== Preparing the environment ===== Database preparation ===It is necessary to prepare the database and prepare the APB address entries for optimisation. This consists of 4 stages.

==== ABP files ====ABP files are documented and should have referential integrity. The following tables are significant

     1) BLPU table 
     1) Street descriptor table 
     1) DPA table 
     1) LPI table

Load files, lower case, removal of “., ’ characters

Note, as submissions will have an area prefix, that it may be preferable to shard by first post code character in order to:

  1) Enable rapid updates to one section from the ABP source 
  1) Lateral scalability 
  1) Development and Beta testing

==== Indexing strategies ====The following indexes are useful:

  1) Post code 
  1) UPRN 
  1) USRN+ Language for Street description (welsh and English street names for same USRN)

==== Pre-prepared ABP addresses ====It is useful to create consistent ABP addresses based on an the DPA layout but populated with the LPI details as follows:

  1) UPRN 
  1) Table source (DPA or LPI) 
  1) Table primary key (FK to LPI or DPA table no RI) 
  1) Flat (including range and suffixes e.g. 90a-90c) 
  1) Building 
  1) Number (including range and suffixes 1a- 3c) 
  1) Dependent thoroughfare (DPA field, null if LPI) 
  1) Street 
  1) Dependent locality (DPA field, null if LPI) 
  1) Locality 
  1) Post code 
  1) organisation

Town or city are not required

==== Prepared address indexes ====The following compound indexes (including null values) may be useful against the prepared address table

  1) Post code, street, number, building, flat 
  1) Post code, dependent thoroughfare, number, building, flat 
  1) Street, number, post code 
  1) Building, flat, post code 
  1) Organisation, flat, post code

==== Concatenated address index ====For exact matching an index of concatenated address fields with removal of double spaces and leading and trailing spaces is useful

A concatenated address

Post code + ‘ ‘ + flat + ‘ ‘ + building + ‘ ‘ + number + ‘ ‘ dependent th + ‘ ‘ + street + dependent locality + ‘ ‘ + locality

==== Correction lists ====The following correction tables are useful

  1) City list e.g. London 
  1) Known word corrections e.g. flt -> flat, cr -> crescent 1^st= first 
  1) Flat equivalent word list e.g. flat, room, apartment 
  1) DSNot sure about this now. I think it’s a hang over.

GH? Drop word list e.g. ‘the’

  1) “Is a DSYes, but got most of them I think

GHCan use ABP table 15_STreetDescriptor to help with this road” list e.g. road, avenue, street, grove, row, drive, walk, way, square

  1) Swap lists where one word may be swapped for another if in the candidate e.g. apartment -> building, road -> street, nursing -> care, upper-> first 
  1) Number equivalence list

Matching

The idea is to produce one candidate UPRN and address from the ABP addresses indexes.

Algorithms proceed in rank order, the ranking designed so that if an address is found, this would be the most likely. Algorithms may produce more than one match for one algorithm, in which case the preference algorithm is the run.

When a match is found, the process may stop. Not all matches are equivalent and there are grades of likelihood, divided into different levels of approximation or qualifiers.

There are 5 qualifiers of a match, each with a spectrum of probability (likelihood)

  1) Equivalent match i.e. we believe the property is the property that the UPRN refers to 
  1) Near match. We believe the property is the property that the UPRN refers to but the flat or property may be split. This only applies to matches where the flat or number suffix in either the Discovery or the ABP addresses may be ignored or dropped. This is in effect equivalent. 
  1) Sub- property (or child). The Discovery address represents a sub-property of the UPRN property. For example the Discovery address “flat 1a Eagle house” may only match to the higher level DPA or LPI entry of “Eagle house” 
  1) The Discovery address represents a supra-property (parent) of the UPRN property. For example the Discovery address ‘1 Angel Lane’ does not have an equivalent in DPA or LPI but ‘flat 11, 1 Angel Lane’ does exist so it matches to a more detailed identifier qualified as a supra-property 
  1) The Discovery address may represent a sibling of the UPRN property. For example the Discovery address ‘flat 12 , 1 Angel Lane’ may not exist in either ABP file but ‘flat 11, 1 Angel Lane’ does.

=== Learning street spelling corrections ===The systems should learn potential street spelling corrections (or street equivalents) from the submitted addresses. This can then be used to find addresses where the Discovery street or building spelling is incorrect and is a performance enhancement. This may be updated prior to processing a large batch of addresses where response times are not required. For each learned spelling correction, subsequent submissions that contain misspelled streets will be more DSYes, this is how Google works.

GHYes, probably lots of opportunity for machine learning with this

efficient. However, the system may also attempt to match street equivalents in real time.

==== Pre-Population of Street name corrections ====A bulk load of streets is used to generate a list of misspelled streets and their potentially correct equivalent, the correct equivalent being a street in the ABP database.

To avoid overmatching it is worthwhile accepting the first 2 or 3 characters from a submitted street and matching to any ABP street with a distance score of 2 or less.

=== Matching steps ======= Batch publisher matching ====It is assumed that one area is DSShould have said ‘area’

GHSubmitted?

submitted at a time via the requesting API.

==== Gather area post code list ====List the area post codes submitted by the publisher via the API (e.g. ‘E’ or ‘EC’) . This ensures that duplicates are reduced by limiting the scan to the set of addresses in a post code area

==== Formatting of Submitted Discovery address ====Discovery address should be manipulated and formatted before the matching algorithms take place. Most of the matches occur as a result of this manipulation and the more effective the manipulation the higher the chance of a match.

==== Find a match ====This applies the 100+ algorithms to the Discovery address object in order to find the best match. Algorithms are run in Rank order in order to avoid a premature approximate match when a better match is available using different manipulations.

Whilst errors occur in the ABP files it is assumed that Discovery addresses are in error and it is thus Discovery addresses that are manipulated

==== Preference algorithm for more than one match of same rank ====Occasionally an algorithm can find more than one match and a decision has to be made as to which one is preferred.

==== Approved and Inactive UPRNS ====An approved UPRN are preferred over an alternative but historical (inactive) UPRN’s are included.

== Algorithms ===== Rank examples ===The flat problem.

Discovery address = 10 a Mildenhall road e50ru

Potential e50ru, Mildenhall road candidates

  1) Number = 10, Street= Mildenhall road 
  1) Flat = basement flat, Number = 10, street=Mildenhall road 
  1) Flat= Ground floor flat, Number =10, street= Mildenhall road 
  1) Flat=First floor flat, Number = 10, street= Mildenhall road

Rank 1 --> exact field match : FAIL Not exact on number

Rank 11 -> Multi-floor inference rule: Matches to -> Basement flat, Number = 10, Road = Mildenhall road

In the Rank 11 algorithm all possible flats for that number were collected and ‘a’ matches to the lowest GH

GHMuch needed With Tower Hamlets addresses

floor

=== Ranked matching algorithms ===These are documented in the associated spreadsheet

=== Address preformatting ===Before submitting an address for matching it should be formatted to align it with the address class fields. The following table shows an address formatting algorithm in mumps

format(adrec,address) ; ;Populates the discovery address object

;initialise address field variables n d s d="~" set adflat="" set adbuild="" set adbno="" set adepth="" set adeploc="" set adstreet="" set adloc="" set post=""

Lower case the address, remove characterset /. double spaces set d="~" set address=$$lc^UPRNL(adrec) set address=$tr(address,"'") set address=$tr(address,"/","-") set address=$tr(address,"."," ") for q: (address'[" ") set address=$$tr^UPRNL(address," "," ")

get the post code iin the last field and remove spaces set length=$length(address,d) set post=$$lc^UPRNL($p(address,d,length)) set post=$tr(post," ") ;Remove spaces

Try to find how many address lines and which is which ;Find the city line as this can be discarded and use lines before the city if present ;addlines is number of address lines to format set addlines=length-1 for split=length-1: -1:2 do .if $p(address,d,split)'="" d ..if $$iscity($p(address,d,split)) D // If it’s a city then the line count is one less ...set addlines=split-1

Pack address lines – remove null fields f i=1: 1:addlines do .i $p(address,d,i)="" d ..s address=$p(address,d,0,i-1)_d_$p(address,d,i+1,i+4) ..s addlines=addlines-1

;may have too many address lines with number by itself in field 1 I addlines>2 d .i $p(address,d,1)?1n.n,$p(address,d,2)?1l.e d ..s $p(address,d,1)=$p(address,d,1)_" "_$p(address,d,2) ..s address=$p(address,d,1)_d_$p(address,d,3,10) ..s addlines=addlines-1

Still too many, number by itself in field 2 i addlines>2 d .i $p(address,d,2)?1n.n,$p(address,d,3)?1l.e d ..s $p(address,d,2)=$p(address,d,2)_" "_$p(address,d,3) ..s address=$p(address,d,1,2)_d_$p(address,d,4,10) ..s addlines=addlines-1

Peculiar post code split across two lines ;if $e(post,1,$l($p(address,d,4)))=$p(address,d,4) set addlines=3

Initialise address line variabs

set adbuild=$p(address,d) ;By default first is house name/ building

set adstreet=$p(address,d,2) ;By default second is street

set adbuild=$$lt^UPRNL(adbuild)

set adstreet=$$lt^UPRNL(adstreet)

if >2 lines then last is locality and append street ;if >3 lines then dependent street if addlines=5 d .set adloc=$p(address,d,addlines) .set adeploc=$p(address,d,addlines-1) .set adepth=$p(address,d,2) .set adstreet=$p(address,d,3) if addlines=4 d ;Must have street dependency .s adeploc=$p(address,d,addlines-1) .s adloc=$p(address,d,addlines) i addlines=3 d .s adloc=$p(address,d,addlines) .I $tr(adstreet," ")=post d ..s adstreet=adloc,adloc="",$p(address,"~",2)=""

Dependent locality is street i adeploc'="" d .i $$isroad(adeploc),'$$isroad(adstreet) d ..if adstreet?1l.e,adeploc?1n.n."-".n1" "1l.e d ...i adstreet["flat" d ....s adbuild=adstreet_" "_adbuild ....s adstreet=adeploc ....s adeploc="" ...e d ....s adbuild=adbuild_" "_adstreet ....s adstreet=adeploc ....s adeploc=""

Location is street, street is building if $$isroad(adloc),'$$isroad(adstreet) do .if adloc?1n.n1" "1l.e d q ..if adstreet?1n.n do ...i adbuild?1l.l.e d q ....s adbuild=adstreet_" "_adbuild ....s adstreet=adloc ....s adloc="" .i adstreet?1n.n!(adstreet?1n.n1"-"1n.n)!(adstreet?1n.n1l) do q ..s adstreet=adstreet_" "_adloc ..s adloc="" .i adflat="" d q ..s adflat=adbuild ..set adbuild=adstreet ..set adstreet=adloc ..set adloc="" .set adbuild=adbuild_" "_adstreet .set adstreet=adloc .set adloc=""

Only one line, likely to be street But may be flat and building if addlines=1 d .set adstreet=$p(address,d),adbuild=""

i addlines=2,adstreet?1n.n,adbuild'="" d

.s adstreet=adstreet_" "_adbuild

.s adbuild=""

Location is actually number and street if adloc?1n.n.l1" "1l.e d .if adstreet'?1n.n.l1" ".e d ..set adbuild=adbuild_" "_adstreet ..set adstreet=adloc ..set adloc=""

Street starts with flat number so swap ;May or may not contain building if $$isflat^UPRNU(adstreet) d ;Might be flat .if '$$isroad(adstreet) do q ;straight swap ..set xbuild=adbuild ..set adbuild=adstreet ..set adstreet=xbuild .else d ..if $$isno($p(adstreet," ",3)) do ...if adbuild'="" d ....set adbuild=$p(adstreet," ",1,2)_" "_adbuild ...else d ....s adbuild=$p(adstreet," ",1,2) ...set adstreet=$p(adstreet," ",3,20)

Ordinary flat building various formats, split it up if adflat="" do flatbld(.adflat,.adbuild)

Ordinary street format , split it up do numstr(.adbno,.adstreet)

Is number in the flat field? if $$isno(adflat) d .if adbuild="" d ..if adbno="" d ...set adbno=adflat ...set adflat=""

Building is street,street is null or not ;111 abbotts park road, , ;111 abotts park road, leyton,, ;111 abbotts park road , leyton, leyton if $$isroad(adbuild) do .if adbno="",adstreet="" d q ..set adbno=adflat ..set adstreet=adbuild ..set adflat="",adbuild="" .if adbno="",adloc="" d ..i '$$hasflat^UPRNU(adbuild) d Q ...set adloc=adstreet ...set adbno=adflat ...set adstreet=adbuild ...set adflat="" ...set adbuild="" ..d splitstr(adflat,adbuild,adbno,adstreet,.adflat,.adbuild,.adbno,.adstreet) .if '$$isroad(adstreet) do ..if adbno="" do q ...if adstreet=adloc d ....set adstreet=adbuild ....set adbuild="" ....set adbno=adflat ..if adbno'="" do ...set xbuild=adbuild ...set xflat=adflat ...set adbuild=adstreet ...set adflat=adbno ...set adbno=xflat ...set adstreet=xbuild

Building is number,make sure street doesn't have the number ! ;Number contains flat so assign number to flat if adbno?1n.n.l1" "1n.n.l d .s adflat=$p(adbno," ") .s adbno=$p(adbno," ",2)

;Strip space from number to assign suffix if adbno?1n.n1" "1l s adbno=$tr(adbno," ")

Street is a number, locality is the street if $$isno(adstreet) d .if adbno'="" d ..s adbno=adstreet ..s adstreet=adloc ..s adloc=""

Locality is street, street is building if $$isroad(adloc) d .if adflat="",adbuild="" d ..s adflat=adbno,adbno="" ..s adbuild=adstreet ..s adstreet=adloc,adloc=""

Confusing flat number now split out if $$isflat^UPRNU(adbuild) d .if adflat=adbno d ..s adflat=$p(adbuild," ",1,2) ..s adbuild=$p(adbuild," ",3,10) .else d ..if adflat'="" d ...if adbuild?1l.l1" "1l1" "1l.e d q ; room f unite stratford ....set adflat=adflat_" "_$p(adbuild," ",1,2) ....set adbuild=$p(adbuild," ",3,20) ...i adbuild?1l.l1" "1l d q ; room h ....set adflat=adflat_" "_$p(adbuild," ",1,2) ....set adbuild=$p(adbuild," ",3,20)

Street has flat name and flat has street if $$isflat^UPRNU(adstreet) d .if adflat?1n.n d ..if adbuild'="" d ...n flatbuild ...set flatbuild=$S(adbno'="": adbno_" ",1:"")_adstreet ...set adbno=adflat ...set adstreet=adbuild ...set adflat=$p(flatbuild," ",1,2) ...set adbuild=$p(flatbuild," ",3,20) ...if adbuild?1l do ...set adflat=adflat_" "_adbuild,adbuild=""

Duplicate flat building number and street,remove flat and building if adflat'="",adbuild'="",adbno'="",adstreet'="" d .if $e((adbno*1)_" "_adstreet,1,$l((adflat*1)_" "_adbuild))=((adflat*1)_" "_adbuild) d ..set adflat="",adbuild=""

first floor 96a second avenue ;street contains flat term before the number n i,word if adbno="" do .for i=2: 1:$l(adstreet," ") do ..set word=$p(adstreet," ",i) ..if word?1n.n.l do ...if adflat="",adbuild="" d ....set adflat=$p(adstreet," ",1,i-1) ...else do ....if adflat'="" do .....if adbuild="" d ......set adbuild=$p(adstreet," ",1,i-1) .....else do ......s adbuild=adbuild_" "_$p(adstreet," ",1,i-1) ...set adbno=word ...set adstreet=$p(adstreet," ",i+1,20)

;street contains flat number near the end if adstreet[" flat " d .set adflat="flat "_$p(adstreet,"flat ",2,10) .set adstreet=$p(adstreet," flat",1)

Bulding is number suffix ; a~12 high street if adbuild?1l,adflat="",adbno?1n.n do .set adbno=adbno_adbuild .set adbuild=""

Street number mixed with flat and building ;20 284-288 haggerston studios~ kingsland road if adbuild?1n.n1" "1n.n."-".n1" "1l.e do .if adflat="",adbno="" d ..set adflat=$p(adbuild," ",1) ..set adbno=$p(adbuild," ",2) ..set adbuild=$p(adbuild,3,20)

duplicate flat number in building number without street ;46, 46 ballance road if adbuild?1n.n1" "1n.n do .if adbno="",adflat="" do ..set adbno=$p(adbuild," ",2) ..set adflat=$p(adbuild," ",1) ..set adbuild=""

110 , 110 carlton road ;Duplicate flat and number if adflat=adbno,adbuild="" d .set adflat=""

street number is in location! ; bendish road , 11 if adloc?1n.n,adbno="" do .set adbno=adloc .set adloc=""

Error in flat number ;flat go1 if $p(adflat," ",2)?1l.l1"o"1n.n d .set $p(adflat," ",2)=$tr($p(adflat," ",2),"o","0")

Now has flat as number and number still in street ;,,flat 1, 22 plashet road if adbno'="",$$isno($P(adstreet," ")) do .if adflat="",adbuild="" d ..set adflat=adbno ..set adbno=$p(adstreet," ") ..set adstreet=$p(adstreet," ",2,20)

building is the number if $$isno(adbuild),adstreet'="",adbno="" do .set adbno=adbuild .set adbuild=""

suffixes split across fields if adflat'="",adbuild?1l1" "1l.e do .set adflat=adflat_$e(adbuild) .set adbuild=$p(adbuild,2,20) .set adbuild=$p(adbuild,2,20)

if adbno'="",adstreet?1l1" "1l.e d

.I $e(adstreet)'="y" d

..set adbno=adbno_$e(adstreet)

..set adstreet=$p(adstreet," ",2,20)

Two streets if $$isroad(adloc),$$isroad(adstreet) do .if adflat="",adbuild="" do ..set adflat=adbno ..set adbuild=adstreet ..set adbno="" ..set adstreet=adloc ..set adloc="" ;

009 ;strip leading zeros if adflat?1n.n set adflat=adflat*1 if adbno?1n.n set adbno=adbno*1

Building ends in number i adbno="",adflat="",adstreet?1l.l1" "1l.l.e d .i $p(adbuild," ",$l(adbuild," "))?1n.n d ..s adbno=$p(adbuild," ",$l(adbuild," ")) ..s adbuild=$p(adbuild," ",1,$l(adbuild," ")-1) ;Correct spelling s adbuild=$$correct^UPRNU(adbuild) s adstreet=$$correct^UPRNU(adstreet) set flat=$$flat^UPRNU($$correct^UPRNU(adflat))

set address object values s address("flat")=flat s address("building")=adbuild s address("number")=adbno s address("deploc")=adeploc s address("depth")=adepth s address("street")=adstreet s address("locality")=adloc s address("postcode")=post s short="",long=""

eform q

splitstr(oflat,obuild,obno,ostreet,adflat,adbuild,adbno,adstreet)

;Splits up building into street and vice versa

n i,xbuild,xstreet

f i=1:1:$l(obuild," ") d

.i $p(obuild," ",i)?1n.n d

..i $$hasflat^UPRNU($p(obuild," ",i+1,i+10)) d

...s adbno=adflat

...s xstreet=adstreet

...s adstreet=$p(obuild," ",0,i-1)

...s adflat=$p(obuild," ",i,i+10)

...s adbuild=xstreet

q

isno(word) ;is it a number

if word?1n.n q 1

if word?1n.n1l q 1

if word?1n.n1"-"1n.n q 1

if word?1n.n1l1"-"1n.n1l q 1

q 0

flatbld(adflat,adbuild) ;

;is it a flat or number and if so what piece is the rest?

if $$isflat^UPRNU(adbuild) do q

.set adflat=$p(adbuild," ",1,2)

.set adbuild=$p(adbuild," ",3,10)

.if adbuild?1l1" ".e d

..set adflat=adflat_$p(adbuild," ")

..set adbuild=$p(adbuild," ",2,20)

2nd floor flat etc i adbuild'="" s $p(adbuild," ")=$$correct^UPRNU($p(adbuild," "))

18pondo road if adbuild?1n.n2l.l1" "2l.e do q .n i .f i=1: 1 q:$e(adbuild,i)'?1n d ..set adflat=adflat_$e(adbuild,i) .set adbuild=$p(adbuild,adflat,2,10)

19a if adbuild?1n.n.l do .set adflat=adbuild .set adbuild="" if adbuild?1n.n1" "1l do .set adflat=$p(adbuild," ")_$p(adbuild," ",2) .set adbuild=""

19 a eagle house if adbuild?1n.n1" "1l1" ".e do .set adflat=$p(adbuild," ",1)_$p(adbuild," ",2) .set adbuild=$p(adbuild," ",3,20)

18dn forth avenue if adbuild?1n.n2l1" "1l.e d .set adflat=$p(adbuild," ",1) .set adbuild=$p(adbuild," ",2,10) .set adflat=adflat*1

19 eagle house or garden flat 1 if adbuild?1n.n.l1" "1l.e do .set adflat=$p(adbuild," ",1) .set adbuild=$p(adbuild," ",2,20)

19a-19c eagle house if adbuild?1n.n.l1"-"1n.n.1" ".l.e do .set adflat=$p(adbuild," ",1) .set adbuild=$p(adbuild," ",2,20)

19- eagle house if adbuild?1n.n1"-"1" "1l.e do .set adflat=$p(adbuild,"-",1) .set adbuild=$p(adbuild," ",2,20)

first floor flat if adbuild[" flat"!(adbuild[" room"),adflat="" do .set adflat=adbuild .set adbuild=""

116 - 118 if adbuild?1n.n.l1" "1"-"1" "1n.n.l.e do .set adflat=$p(adbuild," ",1)_"-"_$p(adbuild," ",3) .set adbuild=$p(adbuild," ",4,20)

12 -20 rosina street if adbuild?1n.n1" "1"-"1n.n1" "1l.e do .set adflat=$p(adbuild," ",1)_$p(adbuild," ",2) .set adbuild=$p(adbuild," ",3,20)

a cranberry lane if adbuild?1l1" "1l.l1" "1l.e do .set adflat=$p(adbuild," ") .set adbuild=$p(adbuild," ",2,10)

a203 carmine wharf ;dlg02 carminw wharf if adbuild?1l.l1n.n.1" "1l.e do .set adflat=$p(adbuild," ") .set adbuild=$p(adbuild," ",2,20)

b202h unit building if adbuild?1l1n.n.l1" "1l.e do .set adflat=$p(adbuild," ",1) .set adbuild=$p(adbuild," ",2,20)

flaflat 10 mileset lodge if $p(adbuild," ")["flat" do .I $p(adbuild," ",2)?1n.n.l d ..set adflat="flat"_" "_$p(adbuild," ",2) ..set adbuild=$p(adbuild," ",3,20) .e d ..if adflat'="" d ...set adflat="flat "_adflat ...set adbuild=$p(adbuild," ",2,20)

workshop 6 if adflat="",adbuild?1.l1" "1n.n.l do .s adflat=adbuild .s adbuild=""

q

numstr(adbno,adstreet) ;

;Reformat a variety of number and street patterns

11 high street if adstreet?1n.n1" "2l.e do q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,10)

100 S0oth if adstreet?1n.n1" "1l.n.l.e d q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,10)

123-15 dunlace road if adstreet?1n.n1"-"1n.n1" "1l.e do q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,20)

11a high street if adstreet?1n.n1l1" "1l.e do q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,20)

14 - 16 lower clapton road if adstreet?1n.n1" "1"-"1" "1n.n1" "1l.e do q .set adbno=$p(adstreet," ",1)_"-"_$p(adstreet," ",3) .set adstreet=$p(adstreet," ",4,10)

109- 111 Leytonstone road.... if adstreet?1n.n1"-"1" "1n.n1" ".l.e do q .set adbno=$p(adstreet," ",1)_$p(adstreet," ",2) .set adstreet=$p(adstreet," ",3,20)

109a-111 Leytonstone road.... if adstreet?1n.n1l1"-"1n.n1" "1l.e do q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,20)

110haley road if adstreet?1n.n2l.l1" "2l.e do q .n i .f i=1: 1 q:$e(adstreet,i)'?1n d ..set adbno=adbno_$e(adstreet,i) .set adstreet=$p(adstreet,adbno,2,10)

1a if adstreet?1n.n1l do q .set adbno=adstreet .set adstreet=""

99 a high street if adstreet?1n.n1" "1l1" ".e do q .if $p(adstreet," ",2)="y" d ..set adbno=$p(adstreet," ",1) ..set adstreet=$p(adstreet," ",2,20) .e d ..set adbno=$p(adstreet," ",1)_$p(adstreet," ",2) ..set adstreet=$p(adstreet," ",3,20)

9a-11b high street if adstreet?1n.n1l1"-"1n.n1l1" ".l.e do q .set adbno=$p(adstreet," ",1) .set adstreet=$p(adstreet," ",2,20)

10-10a blurton road if adstreet?1n.n1"-"1n.n1l1" "1l.e d .set adbno=$P(adstreet," ",1) .set adstreet=$p(adstreet," ",2,20)

99- high street if adstreet?1n.n1"-"1" "1l.e d .set adbno=$p(adbuild,"-",1) .set adstreet=$p(adstreet," ",2,20)

westdown road 99 i $p(adstreet," ",$l(adstreet," "))?1n.n d .s adbno=$p(adstreet," ",$l(adstreet," ")) .s adstreet=$p(adstreet," ",0,$l(adstreet," ")-1) q

isroad(text) ;

if text["road"!(text["street")!(text["avenue")!(text["court")!(text["square")!(text["drive")!(text[" way")!(text[" lane") q 1

if text[" grove"!(text[" row")!(text[" close") q 1

if text[" walk"!(text[" mews")!(text["causeway") q 1

if text[" park" q 1

I $p(text," ",$l(text," "))="st" q 1

q 0

iscity(text) ;

n word,done

s done=0

s word=""

for s word=$O(^UPRNS("CITY",word)) q:word="" d

.i text[word s done=1

q done

@@ Line 180: / Line 180: @@
 Once Loaded, reformatted and super-indexed. The address database is ready to be used
-=== Candidate address handling ===
+=== Candidate address pre-formatting ===
-It is assumed that an address is submitted as one or more delimited fields with the post code at the end of available.  Optionally, an area based qualifier  is included, which is useful of there is no post code. For example a practice post code or simply a list of major postcodes. These narrow down the options when checking against 100 million+ addresses.
+It is assumed that an address is submitted as one or more delimited fields with the post code at the end of available.
+Optionally, an area based qualifier  is included, which is useful of there is no post code. For example a practice post code or simply a list of major postcodes. These narrow down the options when checking against 100 million+ addresses.
-==== Candidate address formatting ====
 Candidate addresses are subject to the same reformatting rules as standard addresses (spell checking etc). However this needs to be done in two stages