No Not DataScience just Data Analysis

Over last 6+ years we have worked with various folks who wanted to learn more from data. This has been more of learning for us

1. Subsidized items beneficiaries – This is very big initiative with potential of pilfering, multiple entitlements. We focused on multiple  entitlements with available digital information.
– missing addresses.
– straight forward same household address
– wrong/unverified addresses with missing documents
– same person name spelled differently and related person information
spelled slightly differently
– having a “presence” across multiple locations farther apart
– missing biometric information where it was required
– corrupted biometric data
– missing “supporting” documents

Most of the issues of dubious addresses/missing/questionable documents indicates
issues at various levels(acceptance, ingestion, approval).

2. Subsidized healthcare data
This enables people to take care of critical health issues in subsidized
fashion.
We found out lot of obvious data issues
– plastic surgery repeated for different body parts for same folks over
years
– people doing delivering kids in short period
– certain districts doing lot more claims overall for surgeries(u, burns,
additional stent
– stay in icu for neuro but medicines for something else
– stay for whipples(oncology surgery) of any kind, increased Mastectomy of any kind without district data showing increase. May be it is just a co-incidence.
– Ureteric Reimplantations, Paediatric Acute Intestinal Obstruction larger
than others.

 

3. Elector Data
Challenges here range from missing supporting data, duplicate information.
The duplicates or just findings were very interesting
– people living in temples (sadhus are apparently exempt), schools
– multiple families living across various parts of state (labor on move)
– people thinking multiple voter-id cards helps to take advantage of some
gov schemes like ration/subsidized food or just as backup in case one is
lost
– woman married to 4 people …(possible in certain tribal locations)
– people with various versions of name (first, name, family) at same
address with little variation of age thrown in too

4. Non performing assets in lending firms
This sort of bubbled up when the corebanking effort took place and lot of
database “constraints” had to be loosened up to enable uploading in some
places.
– This reflects in lot of accounts with very less substantiated documents
and them turning into NPAs over time.
– Specially bad for the co-operative agencies where governance is very
less.

This was the one case where we used simple classification/clustering
mechanisms to simplify our analysis.

5. Rental cab agency
This was unique in terms of “cost” control measures. One particular trip
always used to consume more fuel then compared to normal transport. It was
found cab drivers congregate outside the expensive parking to avoid paying
it and thus end up using more fuel to come in and pick up customers.
Certain locations/times always again always had bad feedback in terms of
response- reason being drivers located far away with cheaper/no parking
or having food/rest in cheaper location.

At times I would have loved to throw data to blackbox which could throw
back questions and beautiful answers. Honestly more time was spent in
getting data,cleaning, re-entering missing data – (surgery description diff
than type). Later on simple grouping/sum/avg/median (stats) kind of
exploration threw up lot of information that we found.

 

 

 

The other “requirements” of the managed datastores in cloud

We(me and @Vinod - author of  extremexperts) have supported migration to managed SQLAzure stores for quite sometime. Customers like ease of manageability, availability and decent performance.

There is another class of customers who keep getting pushed for “consolidating” databases and manage them for SLAs (DR/HA,backups-go-back- intime-x,performance). These databases are not in TBs but range from few GBs to 100s of GBs.

1. There is need for synchronization with on- premise databases and gasp sometimes it needs to be bidirectional.

2. There is need of meeting security SLAs by providing auditing views, encryption.

Promise of cloud where it enables ease of management/availability also needs to enable these scenarios. Hopefully in future we will get these.

Nginx on Azure

Nginx works on Azure, absolutely no issues. It has very vast capabilities. I came to know of few of them only when customer requested that discussion.

1. Ability to control request processing – Customer wanted to throttle number of requests coming from a particular IP address. This was easily done with limit_req module directive. It allowed easy throttling behavior defn, what to do when limits are reached, crossed. Logging is done for these kind of requests and ability to send specific http error message is possible. (503 is enough). It also enables storing the state of current excessing requests. Another learning was to use $binary to help pack a little bit more – though it does make it difficult to decipher in simple way. So in the http block

limit_req_zone $binary_remote_addr zone=searchz:10m rate=5r/s;

followed by location (end points which need this – login/search)

location = /search.html { limit_req zone=searchz nodelay; }

This protects very nicely against http issues but does not protect against ping floods and other ways people can do ddos for your application. This is best prevented/controlled in some kind of appliance (hw) or at least iptables. That though is different subject alltogether. There is another directive

2. Splitting clients for testing – This too is very easily done in the configuration with split_clients directive. It can also be used to set specific querystring parameters very easily.

Yes there are dedicated services/apps to do achieve same functionality – but it is wonderful to learn everyday. Customer/Partners are King and honestly  great teachers.

NoSql Stores and aggregations

In normal db world we are comfortable running queries in creating statistics (sum/min/max/avg), some percentile. Achieving this in efficient way we climb the pre-aggregated  world of olap across dimensions. If we need some kind of range/histograms/filters – we try to anticipate them and provide ui elements and again push of queries to datastore.  With bunch of in-memory columnar storages – we try to do them on fly. With MPP  systems we are comfortable doing it when required.

Over time need has come up to have aggregations to be created in declarative
manner.

Approach in ElasticSearch is pretty nice addition. ES works across the cluster.

In a way to understand – yes ES too are queries of search kind , but declarative model makes them better to fathom.

top_hits agg (coming in 1.3)
Parent/Child support not yet.

It definitely looks little bit more than facets – as the composability is key.

With Cassandra – you do the extra work while writing. But generally it is not
composable and can only do numeric functions. This requirement is tracked at high level here.

Database engines sort of gave away this field by quoting standards, issues of performance/consistency rather than creating decent enough declarative mechanism.90% of queries in DB world will become simpler. Materialized view was the last great innovation and we stopped there. Product designer/engineers should be made to look at the queries people need to write to get simple things done.

 

Data Ingestion and Store stories

In about last 6 months we have had good fortune to understand/implement 6 solutions for customers who need fast ingestion and then some kind of analytics on it. This is a gist of those interactions, what worked, what did not fly, workarounds.

These solutions pushed us to explore things not available out of box on the platform. We
were exposed to
– 3 customer designing monitor/analyze/predict solution. They had existing inhouse
platform but it requires local storage, changes involve changing the software/hardware.
2 of them did not even have “automated monitoring” – a person would go – “note down” the
reading on paper/smartphone web-app and then this would be aggregated and stored in
central location.
– All of them wanted to move to public cloud except one who wanted something they could
deploy on-premise too.

Domains
– Electricity
– Pharma manufacturing
– Healthcare
– Chemical/heavy metal manufacturing

Data sources
– Sensors
– linux/windows embedded devices collecting aggregating floor/section/machine wise data
– Humans entering data

Latency
– almost everything except healthcare varied from 10s of minutes to hours.

Data size
– Since data could get buffered/batched/massaged depending on situation. Never more than an MB.
– Few Hundred Kbs

In-Order/One time delivery guarantees?
– Very pragmatic customers – they were ok to define an error rate rather than insisting
on specifics.

Not even one wanted direct “sending” of data to “store”. They wanted local/web
intermediate processing. This why this internet of things where protocols are rigid and stores fixed was surprise for us all.
How to ingest fast
– what could be the front end
– does it make sense to have intermediate queue

How to scale the store 
– always capture- key condition

How to query with low latency
– search/lookup specific items – for logs/keywords/facet around them
– aggregates/trends
– detailed raw reports
– help in “outage”/”demand” – constant across electricity/manufacturing
– definition of outage/demand change

What works as store
Cassandra
– if you think of read queries before hand as it dominates the design (CQL or otherwise)
*** all facet kind of stuff – which is sort of groupby + no relevancy is dependent on
how data is stored.
– scales – scales and scales
– Reads are pretty good and many of “aggregates” which do not require as of last
millisecond/second resolution – can be done by simple running jobs which store these
“more latency” items in another store – k/v or relational and generally cached
aggressively. (mostly flushed out after x entries to another store)

- Push out data for other kind of analysis to fav store – hdfs and absorb into other
places
Challenge is monitoring(infra vs running of the Cassandra and parameter impact) and
skillset upgradation.

- At times customer have used store – numeric/other data in cassandra and push -
unstructured data(stack trace/messages/logs) out to lucene derivative -
Solr/ElasticSearch. Challenge has been “consistency” at given time but generally works.

How to ingest/Broker
– WebApi front ends pushing data into broker (rabbitmq/msmq/kafka) – mostly based on
experience and comfort factor
** To try – Akka/Orleans + Storm for Near real time analytics
** Only one brave soul still doing Kafka + Storm – painful to manage/monitor

Need better “Monitoring across the stack” tools.

Multi-tenancy is another issue which blows up due to sku differentiation where sometimes
data can shared/but updates-patching becomes an issue.
Data movement in Azure becomes a bigger issue and we have implemented as mentioned here.

 

 

What do ISVs trying to bring their solutions to Cloud want ?

Easy to understand Billing model 

Make it easy to reason about the billing model, simpler than what is exposed to “pay per use”. I  need to use it every day. It should just work without surprise. Do not expose the – you looked at  me – y $, you asked for that z$.

Tell me about your maintenance cycles (please)

For end customers using a solution, downtime  communication is essential. Ideally 24*7 operation is required but we can craft a solution which can deliver minimum viable option at lower cost.

Business relationship
Active Go To Market without very informal Partner network requirements highlighting where we are and how to move forward. Help us unseat the existing partner brokers who are deadweight – whose  whole deployment models are a challenge. That aircover we talked about needs to be about partners, partners, partners. Help unlock the cio-tech-team ice. It is not about x% discounts.

Support
Real support in terms of what does not work rather than “green my scorecard” – so just use it(shove down my throat). Own up the support issues and help bring down my costs and increase your spread. Get folks who understand both business and technology. Let us know at what is coming down which can potentially make us commodity. Be honest about it.

Win-Win
I bring you x $, you provide me 0.20%x. No really – make the partnership work in a simple way. Right now this model is challenging to say the least. At times -
build/test/bill model is a challenge to agility. Let us find a way to make the adoption faster.

Here is shoutout to Vijay who recently joined MongoDB and he correctly points out “lack of lever”  with both customer and seller – there is no complexity. http://andvijaysays.com/2014/03/25/are-we-there-yet-cant-wait-to-start-my-new-adventure/

In cloud based setup it is much more stark.

Forecasting – R and PowerBI references

PowerBi team announced a wonderful forecasting feature.  It allows beautiful visual exploration of forecast,  hindcast & confidence interval. It also allows detection of
seasonality.

You can play/read about it here.

On other hand last year we helped customers   forecasting with help of R and they were pretty happy. R’s forecast package was written by Hyndman who has kindly recorded a great video on it. His cran documentation too is  very neat and detailed.

The forecast package by itself allows creation of  “period/season”. It allows analysis by different methods  and plot them to show the deviation.

Reference
Power BI -
For first  time I have seen teams describing algorithms they have used  and how they diverge- validation window – compute the sum of squares of prediction errors for the window as a whole –  thus dampening the variation . Team is also generous with links, references and best practices.

R – forecast package. This simple code is at github.

As the fan chart indicates error definitely goes up as the “period” becomes large for prediction.

BTW – Excel itself has decent trending abilities in form of trendline(linest) and lately forecasting.

The challenge with exponential smoothing is in terms of base which is mostly chosen as an average or worst case recent observations are given more weightage. Holt-Winter’s
Multiplicative exponential smoothing can take care of seasonality and auto-co-relation.

Gotta do the same exercise as R one day in PowerBI – I like the ease of use, visualization. R on other hand provides lot more ways to try out.