Over last 6+ years we have worked with various folks who wanted to learn more from data. This has been more of learning for us
1. Subsidized items beneficiaries – This is very big initiative with potential of pilfering, multiple entitlements. We focused on multiple entitlements with available digital information.
– missing addresses.
– straight forward same household address
– wrong/unverified addresses with missing documents
– same person name spelled differently and related person information
spelled slightly differently
– having a “presence” across multiple locations farther apart
– missing biometric information where it was required
– corrupted biometric data
– missing “supporting” documents
Most of the issues of dubious addresses/missing/questionable documents indicates
issues at various levels(acceptance, ingestion, approval).
2. Subsidized healthcare data
This enables people to take care of critical health issues in subsidized
We found out lot of obvious data issues
– plastic surgery repeated for different body parts for same folks over
– people doing delivering kids in short period
– certain districts doing lot more claims overall for surgeries(u, burns,
– stay in icu for neuro but medicines for something else
– stay for whipples(oncology surgery) of any kind, increased Mastectomy of any kind without district data showing increase. May be it is just a co-incidence.
– Ureteric Reimplantations, Paediatric Acute Intestinal Obstruction larger
3. Elector Data
Challenges here range from missing supporting data, duplicate information.
The duplicates or just findings were very interesting
– people living in temples (sadhus are apparently exempt), schools
– multiple families living across various parts of state (labor on move)
– people thinking multiple voter-id cards helps to take advantage of some
gov schemes like ration/subsidized food or just as backup in case one is
– woman married to 4 people …(possible in certain tribal locations)
– people with various versions of name (first, name, family) at same
address with little variation of age thrown in too
4. Non performing assets in lending firms
This sort of bubbled up when the corebanking effort took place and lot of
database “constraints” had to be loosened up to enable uploading in some
– This reflects in lot of accounts with very less substantiated documents
and them turning into NPAs over time.
– Specially bad for the co-operative agencies where governance is very
This was the one case where we used simple classification/clustering
mechanisms to simplify our analysis.
5. Rental cab agency
This was unique in terms of “cost” control measures. One particular trip
always used to consume more fuel then compared to normal transport. It was
found cab drivers congregate outside the expensive parking to avoid paying
it and thus end up using more fuel to come in and pick up customers.
Certain locations/times always again always had bad feedback in terms of
response- reason being drivers located far away with cheaper/no parking
or having food/rest in cheaper location.
At times I would have loved to throw data to blackbox which could throw
back questions and beautiful answers. Honestly more time was spent in
getting data,cleaning, re-entering missing data – (surgery description diff
than type). Later on simple grouping/sum/avg/median (stats) kind of
exploration threw up lot of information that we found.