Data Quality Services(DQS) – deduplication/matching Hindi data with SQL Server

This is an attempt to do 101 of 101 (Vinod and pinal know the reason why we have to do this :()  .  I have been interested in de-duplication for longest time and wished DQS was released earlier so that with 2012 they could have had “fresh” start.

In any case here we are looking at data from http://fcs.up.nic.in/upfood/helpline/ReportRegidWise.aspx. Idea is to find potential duplicates using set of attributes like card holder name, father/husband name, mother name. This data is represented in excel here. (data could be present in SQL too).

Image

To do matching a dqs project of sql server (matching policy) needs to be created. Image

We need to choose the data source which is the excel in our case.

Image

Now we need to choose the source of data-sheet number where data resides. Row header option needs to be chosen to indicate first row has the names of the columns. Image

Important element of matching is defining the domain – which is essentially defining data about the data. In this case metadata about the column we are going to examine. It involves giving the name, datatype at minimum (data/integer/decimal/string). For name columns we are going to use string data with language as “other”. This ensures Hindi etc can be compared.

Image

Once the domain is defined we need to map the columns to the domains which is pretty straight forward process.

Image

Now we need to define the rules for matching. In our case we are going to compare the rows based on 3 attributes(card holder name ,father/husband name, mother name). One can do exact search but we will choose the similarity based lookup. We also need to give weight to the column which will be used to dominate the comparison. We have chosen 70,20,10 across cardholder,father,mother names.

Image

After this is simple start or profiling.

Once you look at the matched results – there is Mr. Kailash who has same sounding father in two villages  – potentially a duplicate.

Image

Then we have exact matches

Image

The profiler can show the need of data cleansing/messaging. Honestly every de-duplication exercise should first undergo normalization/cleansing process followed by physical verification process. As data can indicate only so much. It is also better to have more columns to compare to provide granular control.

Nearly 40% data for mother’s name is missing as is card # :(.Aadhar can simplify the enrollment by cleaning up the duplicates from here and comparing it with others – unfortunately they wanted “completely new” enrollment etc.

Image

Background - 

Dedupliction and cleansing is required while creating DW or just cleaning the data. Imagine creating a customer 360 view where data for customer is coming from multiple systems. Unfortunately as it can happen information about cusotmer in different systems can be represented in slightly different way. One needs to cleanse this data and de-duplicate before using it to make decision.

String comparision functions in SQL Server -
SQL Server has = operator which can help to compare similar values. Like operator goes a step further to match approximately similar strings but misspellings are not its strength. Soundex/Difference play a role to compare phonetically two strings. Master Data Services adds Similarity function. This function is unique in the sense that it adds Levenshtein edit distance, Jaccard similarity coefficient,Jaro-Winkler distance, Longest common subsequence algorithms.

Jaro-Winkler is usually the best algorithm but should be verified with own data.

Hopefully in future DQS/MDS/Phonetic pieces are integrated for simple usage.

Use Case Scenarios
Banks/Insurance agencies, Government programs (PDS/election card) all have data which needs cleansing and scalable de-duplication services. Sometimes data entry errors like reversal of first & last names results in not matching of names.
Generally data-duplication involves little bit more complexity – where other attributes like age, location address etc also play a role. If age is within months vs years is an expensive query. Location addresses have an issue where addresses are not standardized like US.

Name matching by itself can be applied to detect frauds or law-enforcement or counter-terrorism.

It is a hard problem because of
-misspelling, variations, cross language bias, nicknames, name ommission (english middle names are ommited, Spain-maternal name can before first name), Data Entry errors.

Comparison of methods

http://secondstring.sourceforge.net/doc/iiweb03.pdf

Advanced methods

Many of the algorithms/techniques come from BioMedical world – especially gene-sequencing-matching world. Most of them focus on gaps/transitions. One of them is  Smith waterson  Gotoh – http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/gap-penalty-gotoh.pdf. For example – Sequence similarity search – http://biochem218.stanford.edu/07Rapid.pdf , or http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/lecture3.pdf

Good Reference – When you have lot of time – A guided tour to approximate string matching – Navarro (ACM paper)

Phonetic methods apart from soundex are
NYSIIS
Phonix
Editex
Metaphone
Double Metaphone
Phonetex

Not all of them are applicable to all languages and available in all databases.

SSIS component for use

Lookout for MelissaData which does matching on multiple columns, works out of SSIS and has various algorithms implemented as part of it.

OSS

SecondString.  (sourceforge)

Desired Features

Need for reports/API
- DQS needs to support offline, api based interaction to generate reports. This will not require login into the application.

Need to have more transparency about algorithms/pluggability for different languages
- MDS algorithm should be combined here and choice should be available for choosing the relevent algorithm. Similarly stemming algorithms should be pluggable and available for “choice”. More phonetic algorithms should be added.

Performance guide

-http://www.microsoft.com/download/en/details.aspx?id=29075

One stop guide for DQS

-http://technet.microsoft.com/en-us/sqlserver/hh780961

 

 

About these ads

5 thoughts on “Data Quality Services(DQS) – deduplication/matching Hindi data with SQL Server

  1. Pingback: SQL SERVER – Installing Data Quality Services (DQS) on SQL Server 2012 « SQL Server Journey with SQL Authority

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s