A Novel Method for Validating Addresses Using String Distance Metrics

Address validation is vital since it confirms the quality and geographical precision of addresses used by organizations that rely on location-dependent and delivery-based services. Suppose addresses need to be thoroughly checked in advance. In that case, there may be difficulties with them, such as missing components or geographical deficiencies, which may lead to severe problems with logistics. When doing address validation, discovering missing or incorrect address components is a beneficial aspect in minimizing the likelihood of service problems while saving time and money for organizations. When it comes to addressing validation, using statistical metrics like correlation coefficients and measures of central tendency has been discovered to have a significant amount of untapped potential. In order to obtain a normalized score that is based on statistical similarities, the approach that is suggested in this study makes use of a mixture of several string-matching metrics. This score may then be used to exclude authenticated addresses based on the needed minimum level of similarity, which can be calculated. Experiments have been carried out on a healthcare dataset taken from the actual world to show the efficacy of the suggested method in terms of accuracy and precision.


Introduction
Addresses are fundamental in pinpointing a geographical location on Earth.By improving the accuracy of addresses, considerable savings can be achieved in terms of time and money for organizations that rely on the precision of auto-generated addresses to maximize customer satisfaction [1].Automatic address generation is often done using reverse geocoding, which converts geographical coordinates into textual addresses [2].Even though validating these addresses, i.e., matching them with correct and verified addresses, seems to be a straightforward task, many complications exist while trying to perform the same.
Address validation, verifying the accuracy and precision of an auto-generated address by matching it with a true counterpart,

Open Access
Journal of Mechanical and Construction Engineering (JMCE) A2Z Journals must improve when faced with unstructured addresses with missing attributes or geographic inconsistencies [3].The problem caused by geographical inconsistencies can be seen in Fig. 1, which represents how missing elements in the generated address can be difficult to verify by direct comparison.The main contribution made by this paper includes proposing a system that resolves the issue of address validation by developing a novel algorithm that requires no pre-processing of the input addresses.
It uses statistical correlation measures as weights to combine different string-matching metrics and generate a normalized matching score.This matching score is then utilized to filter out the validated addresses and store them for further use.

Literature Review
Address validation has historically been a pure NLP (Natural Language Processing) task involving sequence labelling.Hidden Markov Models (HMMs) [4] and Conditional Random Field models [5] have been used as a deep learning approach to address validation but suffer in performance when given inputs of non-standardized addresses.In [6], the authors proposed a method using the BERT language model, which can help contextually model text data.However, only some of the existing systems can address validation with some preprocessing before moving to the task of validation, which may cause a compromise in precision, a problem that the proposed system aims to solve.
In order to build an efficient architecture for address validation, the system proposed in this paper uses the support of commonly used string-matching metrics to perform the task of address validation.String-matching metrics can be broadly classified into three major types, (i) Edit distance-based metrics, (ii) Token-based metrics and (iii) Hybrid metrics, a detailed explanation of which can be found in [7].Based on the comparisons carried out by [7] and [8], a set of six best-performing metrics was chosen to design the methodology followed in this paper.Lowenstein distance metric, which assigns unit cost to all edit operations [9], and Jaro Winkler distance, which also takes into account transpositions [10], are the edit-based metrics used, while Jaccard distance and Cosine distance are the token-based metrics utilized described in [11].The Monge Elkan hybrid metric [12] and the Burrows-Wheeler transform distance (BWT) [13] based on string compression have also been used as matching metrics to design the proposed system.

Terminology
a. Manual Address: Address fetched from a database and verified to be accurate and precise.b.Auto-generated Address: Address generated using reverse geo-coding that is to be validated.

Journal of Mechanical and Construction Engineering
(JMCE) A2Z Journals c.Auto Score (Manual Score): Normalized score in the range 0 to 1 indicating similarity, where 0 represents complete mismatch, and 1 represents a perfect match.d.Threshold: Matching score value to be used for filtering out valid auto-generated addresses (default is 0.5)

System Architecture
As represented in Fig. 2, the architecture of the proposed system consists of a database of manual addresses, the corresponding auto-generated addresses, the matching component used to generate a matching score and the filtering component, which uses a threshold value to filter out the validated addresses.The addresses to be compared pass through the matching component, which uses a combination of novel algorithms to generate a similarity score based on various string-matching metrics.The validation process of addresses is then completed by using the final component to filter out valid ad dresses and optionally store them in a database for further processing.

Proposed Methodology
This section provides a brief explanation of the system proposed in this paper, along with a description of the working of its key modules.As represented in the system architecture, the core constituent of the proposed system is the matching component.It generates a normalized matching score using a novel mathematical modelling approach to combine various stringmatching metrics.
The system uses a combination of statistical measures such as Karl Pearson's correlation coefficient and arithmetic mean, mathematical details of which can be found in 1.Take a subset of approximately 10 percent of addresses to be validated 2. For each pair of manual address and auto-generated address, assign a manual matching score by following the given process: 2.1 Start with a baseline score of 0.5 2.2 For each matching token between manual address and auto-generated address, add a value of 0.1 to the manual score, ensuring that score remains below 1 2.3 For each mismatched token in the two addresses: 2.3.1.If the token in manual address adds more specificity and is not present in auto-generated address, subtract a value of 0.1 to the manual score, ensuring that score remains below 1 2.3.2.If the token in auto-generated address adds more specificity and is not present in manual address, add a value of 0.1 to the manual score, ensuring that score remains below 1 3. Using each string-matching metric in set of metrics, calculate normalized distance between each manual address and auto-generated address 4. Calculate the correlation coefficient (corr) between scores generated by each metric and the manual score and store these coefficients as weights 5. Calculate the difference between mean of manual score and mean of scores generated by each metric and store these differences as scalers 1.For each address in auto addresses, repeat the following: 1.1 If autoscore for that address is greater than threshold, add the address to set of validated addresses 2. Store set of validated addresses in a database for further processing

Methodology
In order to demonstrate and prove the efficiency and accuracy of the system, extensive experimentation was carried out using Google Colab.The publicly available 'PMJAY Healthcare Database', details of which can be found on the official website [15], was used as a reference to get the manual addresses for experimentation.It contains a list of addresses of various healthcare centres in India, verified by the government of India.
Reverse geocoding was performed on the coordinates of the hospital locations using the reverse geocoder library in Python to generate a set of auto-addresses to be validated.A subset of 98 samples was then taken from this data, and manual scores were assigned to the corresponding auto-addresses.After calculating normalized distances using the six metrics as suggested in the algorithm, the correlation heatmap was obtained, as shown in Fig. 3.

Journal of Mechanical and Construction Engineering
(JMCE) A2Z Journals Similarly, the distribution trend of the manual score was compared with the distribution trends for each metric by using line plots, as shown in Fig. 4. In order to bring the distribution of the scores in the same range; the metric scores are adjusted with scalers, which represent the difference between mean values of the scores and the mean value of the manual score, which was equal to 0.567347.The weights and scalers to generate the auto-score were thus obtained in Table 1.The auto-scores were then calculated using the specified formula and the confusion matrix was plotted as shown in table 2, to estimate the accuracy of address validation done by the system.The confusion matrix shows that 76 addresses would be filtered out as validated addresses, with only one address wrongly validated as correct.The standard evaluation metrics were then calculated using the confusion matrix to obtain the results displayed in table 3, with a few sample outputs represented in fig. 5. Significant values of precision, which measures the proportion of positive identifications that are actually correct, and recall, that represents the proportion of actual positives correctly identified, were obtained, thus proving the efficiency of the proposed system for the task of address validation.

Conclusion
Address validation, as an NLP task, has often proven to be a long and tedious process that does not have a fixed solution due to the wide variety of techniques available to generate addresses automatically.However, with the help of the system proposed in this paper, address validation can be performed in a simple and efficient manner and can solve problems such as geographical inconsistencies and unformatted inputs to generate accurate matching scores.A combination of novel algorithms to generate a similarity score based on various string-matching metrics combined using weights as their correlation coefficients and adjusted by mean values has thus been explained and proven to be applicable to the address validation task successfully.
Experimental evaluations, carried out on a real-world dataset involving addresses of healthcare centers prove the effectiveness and practicability of these scores generated using the technique provided in this paper.In the future, this work can be extended by generating an API to make the proposed technique more usable and exploring artificial intelligence-based approaches to simplify the task of manual scoring for the system.

Figure 1 .
Figure 1.Problems in Address Validation

Figure 2 .
Figure 2. System Architecture [14], to achieve the validation task.Algorithm-1 calculates a set of parameters for generating a matching score in Algorithm-2, which passes the scores on to Algorithm-3, which filters out the ad-dresses Journal of Mechanical and Construction Engineering (JMCE) A2Z Journals based on the corresponding threshold.Algorithm-1: Generation of Weights for Matching Metrics Input: a. manual addresses: = set of addresses fetched from a database for the given locations, verified to be accurate b. auto addresses: = set of addresses to be validated which may have been generated using reverse geo-coding c. metrics: = set of string-matching metrics to be used, by default 6 metrics are used: [Levenshtein Distance, Jaro Winkler Distance, Jaccard Distance, Cosine Distance, Monge Elkan Distance and BWTRLENCD Distance] Output: a. weights: = set of weights to be used for generating matching score b. scalers: = set of numeric values to bring all metrics within the same range Algorithm: addresses and auto addresses b. metrics: = set of string metrics used in algorithm-1 c. weights: = set of weights generated by algorithm-1 d. scalers: = set of numeric values to bring all metrics in the same range generated by algorithm-1 Journal of Mechanical and Construction Engineering (: = set of matching scores for each pair of addresses in the range 0 to 1, indicating similarity where 0 represents complete mismatch and 1 represents perfect match Algorithm:1.Repeat the following steps to generate auto-score for each pair of manual and auto-addresses:1.1 Calculate values of normalized distance between each manual address and auto-generated address using given metrics 1.2 Calculate autoscore by using the formula and add to set of autoscores.where,n= number of metrics used scaler[metrici] = scaler value for metric 'i' from scalers array weight[metrici] = weight value for metric 'i' from weights array 2. Pass the set of addresses and auto_scores to the next component Algorithm-3: Filtering out validated addresses Input: a. auto addresses: = set generated by algorithm-2 b. autoscores: = set generated by algorithm-2 c. threshold: = value used for filtering out valid auto-generated addresses (default is 0.5) Output: a. set of validated addresses optionally stored in a database Algorithm:

Figure 3 .
Figure 3. Correlation heatmap between manual score and string-matching metrics

Figure 4 .
Figure 4. Comparison of the distribution of manual score and scores produced by string-matching metrics

Figure 5 .
Figure 5. Example scores generated by the proposed system (auto-generated addresses above compared with manual addresses below)

Table 1 .
Weights and Scalers obtained for each metric

Table 2 .
Confusion matrix for obtained results

Table - 3
: Results of experimentation