Data Quality Services(DQS) – deduplication/matching Hindi data with SQL Server
Posted: May 16, 2012 Filed under: Database | Tags: de-duplication, DQS, hindi, matching, Unicode Leave a comment »This is an attempt to do 101 of 101 (Vinod and pinal know the reason why we have to do this
) . I have been interested in de-duplication for longest time and wished DQS was released earlier so that with 2012 they could have had “fresh” start.
In any case here we are looking at data from http://fcs.up.nic.in/upfood/helpline/ReportRegidWise.aspx. Idea is to find potential duplicates using set of attributes like card holder name, father/husband name, mother name. This data is represented in excel here. (data could be present in SQL too).
To do matching a dqs project of sql server (matching policy) needs to be created. 
We need to choose the data source which is the excel in our case.
Now we need to choose the source of data-sheet number where data resides. Row header option needs to be chosen to indicate first row has the names of the columns. 
Important element of matching is defining the domain – which is essentially defining data about the data. In this case metadata about the column we are going to examine. It involves giving the name, datatype at minimum (data/integer/decimal/string). For name columns we are going to use string data with language as “other”. This ensures Hindi etc can be compared.
Once the domain is defined we need to map the columns to the domains which is pretty straight forward process.
Now we need to define the rules for matching. In our case we are going to compare the rows based on 3 attributes(card holder name ,father/husband name, mother name). One can do exact search but we will choose the similarity based lookup. We also need to give weight to the column which will be used to dominate the comparison. We have chosen 70,20,10 across cardholder,father,mother names.
After this is simple start or profiling.
Once you look at the matched results – there is Mr. Kailash who has same sounding father in two villages – potentially a duplicate.
Then we have exact matches
The profiler can show the need of data cleansing/messaging. Honestly every de-duplication exercise should first undergo normalization/cleansing process followed by physical verification process. As data can indicate only so much. It is also better to have more columns to compare to provide granular control.
Nearly 40% data for mother’s name is missing as is card #
.Aadhar can simplify the enrollment by cleaning up the duplicates from here and comparing it with others – unfortunately they wanted “completely new” enrollment etc.
Background -
Dedupliction and cleansing is required while creating DW or just cleaning the data. Imagine creating a customer 360 view where data for customer is coming from multiple systems. Unfortunately as it can happen information about cusotmer in different systems can be represented in slightly different way. One needs to cleanse this data and de-duplicate before using it to make decision.
String comparision functions in SQL Server -
SQL Server has = operator which can help to compare similar values. Like operator goes a step further to match approximately similar strings but misspellings are not its strength. Soundex/Difference play a role to compare phonetically two strings. Master Data Services adds Similarity function. This function is unique in the sense that it adds Levenshtein edit distance, Jaccard similarity coefficient,Jaro-Winkler distance, Longest common subsequence algorithms.
Jaro-Winkler is usually the best algorithm but should be verified with own data.
Hopefully in future DQS/MDS/Phonetic pieces are integrated for simple usage.
Use Case Scenarios
Banks/Insurance agencies, Government programs (PDS/election card) all have data which needs cleansing and scalable de-duplication services. Sometimes data entry errors like reversal of first & last names results in not matching of names.
Generally data-duplication involves little bit more complexity – where other attributes like age, location address etc also play a role. If age is within months vs years is an expensive query. Location addresses have an issue where addresses are not standardized like US.
Name matching by itself can be applied to detect frauds or law-enforcement or counter-terrorism.
It is a hard problem because of
-misspelling, variations, cross language bias, nicknames, name ommission (english middle names are ommited, Spain-maternal name can before first name), Data Entry errors.
Comparison of methods
http://secondstring.sourceforge.net/doc/iiweb03.pdf
Advanced methods
Many of the algorithms/techniques come from BioMedical world – especially gene-sequencing-matching world. Most of them focus on gaps/transitions. One of them is Smith waterson Gotoh – http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/gap-penalty-gotoh.pdf. For example – Sequence similarity search – http://biochem218.stanford.edu/07Rapid.pdf , or http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/lecture3.pdf
Good Reference – When you have lot of time – A guided tour to approximate string matching – Navarro (ACM paper)
Phonetic methods apart from soundex are
NYSIIS
Phonix
Editex
Metaphone
Double Metaphone
Phonetex
Not all of them are applicable to all languages and available in all databases.
SSIS component for use
Lookout for MelissaData which does matching on multiple columns, works out of SSIS and has various algorithms implemented as part of it.
OSS
SecondString. (sourceforge)
Desired Features
Need for reports/API
- DQS needs to support offline, api based interaction to generate reports. This will not require login into the application.
Need to have more transparency about algorithms/pluggability for different languages
- MDS algorithm should be combined here and choice should be available for choosing the relevent algorithm. Similarly stemming algorithms should be pluggable and available for “choice”. More phonetic algorithms should be added.
-http://www.microsoft.com/download/en/details.aspx?id=29075
-http://technet.microsoft.com/en-us/sqlserver/hh780961
So you are a data scientist?
Posted: April 9, 2012 Filed under: Database | Tags: Actuary analysis, Data Scientist, R Leave a comment »This new term has become the aspirational job title. Person whom I would call as Data scientist is - http://blog.revolutionanalytics.com/2012/03/see-how-deloitte-uses-r-for-actuarial-analysis.html
I have no delusions – I am just a worker bee ready to learn.
Session summary at TechED 2012 – NoSql(Non Relational Store) for relational person
Posted: March 29, 2012 Filed under: Database | Tags: Document stores, Key value, Non-Relational Store, NoSql Leave a comment »Like every year I opted for session delivery. One of the session was picked up at last minute and the organizers (Harish/Saranya) approved it immediately. Unfortunately I was little late, anyway scrambled a bit and got the deck and demos up. But I did not pray to demo Gods. Right from display to connectivity to remote server everything played up – this after checking the display 2 days ago and connectivity everything working till last minute. I had backup connection too.
Big learning
Backup for network is another network, but backup for machine is needed. If possible locally. Just don’t remote into different server machines for demo – however comfortable it is at other times.
So anyway this session is 101 for people who are comfortable with relational databases and want to understand why/when/what to use in their scenarios. I chose Redis, Riak, MongoDB, Azure storage/SQL Azure to showcase for 2 min exploration each as this could not have been tutorial for them. I did not have time to explore the MPP/Columnar stores or get deep into how, idea was to convey – why/when and the possible impact. I also did not get into amazon store(s) or coherence, inmemory db etc.
I chose Redis – as it brings familiarity of memcached/velocity, Riak because it is sort of everything from kv,document to search store and ability to add/delete nodes is simple/powerful. I chose MongoDB to just de-mystify the storage – access via indexes/mr of javascript.
Whole idea was to help folks contrast these stores with Relational store from pov of what they are comfortable with (indexes, joins, acid, schema,
monitoring, management). One of the simplest way to skin is to ask question about range queries, index/updates.
Here is the link to the presentation - http://speakerdeck.com/u/govind/p/non-relational-storage-for-relational-person
SSIS continues to SUCK
Posted: December 23, 2008 Filed under: Database | Tags: i hate ssis, ssis sucks Leave a comment »IT forgets the settings on the datasources. It forgets just a sec back it could access the database without any problem. It forgets to remove old metadata. It is the messiest tool ever.
I worked with it on SQL 2005 – it matched the absurdity then, It continues to scale new heights even now in 2008. It is not worth spending time.
I am going back to bcp. It is a pain but I at least get reasonable answers.
On Windows 2008 – it requires you to execute the package as admin as it can’t access perfmon counters(no don’t ask which ones). The execute package utility is sham – it can’t remember it needs to set the /X86 option if it sees the jet oledb provider if either the source or target is on 64 bit machine. Now don’t even get me started on 64 bit provider option – it is a shame that most of the people get/accumulate data through mdb, but 64 bit provider does not exist.
The errors are thrown are gem – CANTACQUIRE…connection – but prey few mins ago – you did without sweat.
Accessing Jet DB on 64 bit windows – “Microsoft.jet.OLEDB .4.0 provider is not registered on the local machine”
Posted: November 29, 2008 Filed under: Database | Tags: 64 bit, dac 6, jet database, mdac 2.8, SSIS, wow, wow64, x64 5 Comments »Update(5th Feb 2010 19 Aug 2011) - There is beta driver download available Jet OLEDB provider – Microsoft Access Database Engine 2010 Redistributable – http://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=en
Official statement(5th Feb 2010) – http://blogs.msdn.com/psssql/archive/2010/01/21/how-to-get-a-x64-version-of-jet.aspx
Scenarios
1. 64 bit SSIS/ETL tool package trying to access Access/Jet database natively using 64 bit drivers.
2. 64 bit native applications (forms/console apps) natively accessing the database.
3. 64 bit native open source frameworks (rails/python based stuff) trying to access the database through regular ado/odbc provider/driver. (lookup the favourite support dl – sorry I do not have answers to adodbapi, odbcmx etc etc…)
4. For pure asp.net applications follow advice of Ken tucker (MVP).
Workaround : One needs to create/use 32 bit application in wow64 mode to ensure 32 bit jet oledb provider can be used. In simple words – just change the compile target to x86 in Visual studio or make file.
Project Settings ->Build ->Platform target -> change the target to x86.(you can accomplish the same with command line too with /platform option)
[update]To deal with SSIS’s rude behavior on 64 bit – check out the /X86 option to execute package on commandline. Also ensure Run64BitRuntime is set to false in the the metadata.
Brute force alternative (Update) – Use Corflags.exe as last option in dire circumstances for “just trying it out”. This only works for .net images/assemblies. No hex editing please
, not supported.
Reason :
There is no 64 bit version of jet driver (hopefully they do as most of the machines will be 64 bit, it will look pretty odd to have to use this trampoline and not being able to do true migration – not everybody will like to migrate to sqlexpress).
Detailed – x64/x86/Wow explaination is available here.
Official word(well blogs are as official as they come) on mdac roadmap.
Update(5th Feb 2010) - There is beta driver download available – http://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=en
Official statement(5th Feb 2010) – http://blogs.msdn.com/psssql/archive/2010/01/21/how-to-get-a-x64-version-of-jet.aspx
Microsoft Access Database Engine 2010 Redistributable
OpenID, OAuth and securing the citadel
Posted: January 9, 2008 Filed under: Database | Tags: Dataportability, OAuth, openId, protocol, security 1 Comment »Will web become closed in future where there will no way to access unauthenticated data alltogether? Sort of driving license for the of internet? All the swarm of protocols/agreements like OpenID, Oauth, Dataportability , apml will they not drive the content generators to enforce a “id” which can be tracked and information used everywhere?
Nope I do not have problems with all these nanoformats trying to get semantics of the “stuff on the page” , I am just afraid of the “market forces which will ensure access to information unauthenticated is not allowed”, I am just worried about access to the data/information without the need to identify yourself ever.
Update- I got some comment from lawyer type(? – did not leave email address/blog) – that Citadel is “Popular” open source email and groupware platform and it now supports OpenID authentication.
To be fair – I had never heard of citadel before ! and the context too was completely different. I love these subtle guerrilla ploys. This adds spice in ironic way to the current post.
When is ISCII better than UNICODE and vice versa
Posted: November 29, 2007 Filed under: Database, Multilanguage | Tags: ISCII, Multilanguage, Unicode Leave a comment »ISCII is great to store names(people/location etc) which do not vary across languages. Consider a 10 Million names database storing names of people which need to be picked up during reports across different languages. One row storage of ISCII can take care of the names and transliteration provided by .Net encoding classes (similar effort can be applied to Unicode too but without lot of success) help display in various indic languages. In case Unicode encoding you will need to store a language specific name( this too could be useful if you are hell bent on correcting names/matras to suit local language) -thus multiplying storage cost.
The cost of storing ISCII is offset by need for translation into Unicode for display(IE is quite ahead in terms of display of unicode data with appropriate font)/capture (with help of INSCRIPT or local variant of phonetic or web based entry).
Indexing/sorting – language specific sorting can be different (Tamil is very different from other languages). A topic for another post alltogether.
Understanding Indian Multilingual computing
Posted: November 12, 2007 Filed under: Database, Runtime | Tags: Indian Language and computer, Indic, Multilingual, Transliteration Leave a comment »After my colleague and respected friend Deepak Gulati implemented the transliteration from Kannada to english and back for one of the projects, my interest in understanding the challenges increased beyond the windows api coding of locale specific world and initial palindrome check for multibyte characters
.
I found a great resource via IITM- Achraya explaining the same with context of Indian languages.
In the present project I also gained new respect for southern languages which do not add to the confusion like Hindi’s devanagari or English does. Phonetic base of these languages helps in correct granular correct pronounciation and representation using a good script (brahmi origin).
Words are just C,V, CV,CCV or in extreme cases CCCV where C stands for consonant and V for vowel. Challenge lies in way the vowel is combined with consonant to provide that unique syllable representing the sound. Challenge with Devanagari representation of words like राष्ट्रिय, आत्मविश्वास,विश्व are pretty difficult to get right in the head, we just remember it by rote and as we did in childhood make fun of people south of Vindhyas for not getting matra, ling etc right. I was spelling nazi (remember soup nazi) and supporter of Hindi as the national language. But lately – must say pretty lately I have come to appreciate the Tamil language (why there is no need for the people residing there to learn hindi – why that is false patriotism, why it stokes the “rule” by the north and resistance feeling, heard even ravan killing is sort of white north wnning over the dark dravidian) and its bretheren. With help of Nudi keyboard which is phonetic in nature – I am finding it easier to learn/type spoken kannada. The default keyboard is inscript which is common for Indian languages but then it loses the nuances of each language.
MS has taken steps over number of years to support multiple languages except providing local language OS
– now I can appreciate it little better as it would be tough to get it right and then maintain it.
there are 4 parts
Entering ( keyboard, mapping to qwerty)
Displaying (font -glyph)
Storage (encoding – how many bytes to store a syllable)
working with data (sorting/searching/frequency count etc).
With this new information I have new found respect for Sanskrit which can pack so much of information in such short shlokas/stotras.







