This is an attempt to do 101 of 101 (Vinod and pinal know the reason why we have to do this :() . I have been interested in de-duplication for longest time and wished DQS was released earlier so that with 2012 they could have had “fresh” start.
In any case here we are looking at data from http://fcs.up.nic.in/upfood/helpline/ReportRegidWise.aspx. Idea is to find potential duplicates using set of attributes like card holder name, father/husband name, mother name. This data is represented in excel here. (data could be present in SQL too).
We need to choose the data source which is the excel in our case.
Important element of matching is defining the domain – which is essentially defining data about the data. In this case metadata about the column we are going to examine. It involves giving the name, datatype at minimum (data/integer/decimal/string). For name columns we are going to use string data with language as “other”. This ensures Hindi etc can be compared.
Once the domain is defined we need to map the columns to the domains which is pretty straight forward process.
Now we need to define the rules for matching. In our case we are going to compare the rows based on 3 attributes(card holder name ,father/husband name, mother name). One can do exact search but we will choose the similarity based lookup. We also need to give weight to the column which will be used to dominate the comparison. We have chosen 70,20,10 across cardholder,father,mother names.
After this is simple start or profiling.
Once you look at the matched results – there is Mr. Kailash who has same sounding father in two villages – potentially a duplicate.
Then we have exact matches
The profiler can show the need of data cleansing/messaging. Honestly every de-duplication exercise should first undergo normalization/cleansing process followed by physical verification process. As data can indicate only so much. It is also better to have more columns to compare to provide granular control.
Nearly 40% data for mother’s name is missing as is card # :(.Aadhar can simplify the enrollment by cleaning up the duplicates from here and comparing it with others – unfortunately they wanted “completely new” enrollment etc.
Dedupliction and cleansing is required while creating DW or just cleaning the data. Imagine creating a customer 360 view where data for customer is coming from multiple systems. Unfortunately as it can happen information about cusotmer in different systems can be represented in slightly different way. One needs to cleanse this data and de-duplicate before using it to make decision.
String comparision functions in SQL Server –
SQL Server has = operator which can help to compare similar values. Like operator goes a step further to match approximately similar strings but misspellings are not its strength. Soundex/Difference play a role to compare phonetically two strings. Master Data Services adds Similarity function. This function is unique in the sense that it adds Levenshtein edit distance, Jaccard similarity coefficient,Jaro-Winkler distance, Longest common subsequence algorithms.
Jaro-Winkler is usually the best algorithm but should be verified with own data.
Hopefully in future DQS/MDS/Phonetic pieces are integrated for simple usage.
Use Case Scenarios
Banks/Insurance agencies, Government programs (PDS/election card) all have data which needs cleansing and scalable de-duplication services. Sometimes data entry errors like reversal of first & last names results in not matching of names.
Generally data-duplication involves little bit more complexity – where other attributes like age, location address etc also play a role. If age is within months vs years is an expensive query. Location addresses have an issue where addresses are not standardized like US.
Name matching by itself can be applied to detect frauds or law-enforcement or counter-terrorism.
It is a hard problem because of
-misspelling, variations, cross language bias, nicknames, name ommission (english middle names are ommited, Spain-maternal name can before first name), Data Entry errors.
Comparison of methods
Many of the algorithms/techniques come from BioMedical world – especially gene-sequencing-matching world. Most of them focus on gaps/transitions. One of them is Smith waterson Gotoh – http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/gap-penalty-gotoh.pdf. For example – Sequence similarity search – http://biochem218.stanford.edu/07Rapid.pdf , or http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/lecture3.pdf
Good Reference – When you have lot of time – A guided tour to approximate string matching – Navarro (ACM paper)
Phonetic methods apart from soundex are
Not all of them are applicable to all languages and available in all databases.
SSIS component for use
Lookout for MelissaData which does matching on multiple columns, works out of SSIS and has various algorithms implemented as part of it.
Need for reports/API
– DQS needs to support offline, api based interaction to generate reports. This will not require login into the application.
Need to have more transparency about algorithms/pluggability for different languages
– MDS algorithm should be combined here and choice should be available for choosing the relevent algorithm. Similarly stemming algorithms should be pluggable and available for “choice”. More phonetic algorithms should be added.