10 things I wished my datastore would do

We use datastores generally to ingest data and try to make some meaning out of it by means of reports and analytics. Over years we have had to make decisions in terms of adopting different stores for “different” workloads.

Simplest being the Analysis – which we offload to pre-aggregated values with either columnar or distributed engines to scaleout the volume of data. We have also seen rise of stores which allow storage of data which is friendly for range of data. Then we have some which allow very fast lookups, maturing to doing aggregations on run. We have also seen use of data-structure stores – the hash table inspired designs vs the ones which don sophisticated avatars (gossips, vector clocks, bloom filters, LSM trees).

That other store which pushed compute to storage is undergoing massive transformation for adopting streaming, regular oltp (hopefully) apart from its regular data reservoir image. Then we have the framework based plug and play systems doing all kind of sophisticated streaming and other wizardry.

Many of the stores require extensive knowledge about the internals of the store in terms of how data is laid out, techniques for using right data types, how data  should be queried, issues of availability and taking decisions which are generally “understandable” to the business stakeholders. When things go wrong – the tools differ in range from just log error to actual “path of the execution” of the query. At present there is lot of ceremony for thinking about the capacity management, issues around how data changes are logged and should be pushed to another location. This much of detail is great “permanent job guarantee” but does not add lot of value in long term for the business.

  1. Take away my schema design issues as much as it can

What do I mean by it? Whether it is traditional relational databases or the new generation no-sql stores. One has to think through either ingestion pattern or the query pattern to design the store representation of entities. This by nature is productivity killer and creates impedance mismatch between storage and representation in application of the entities.

  1. Take away my index planning issues

This is another of those areas where lot of heart burn takes place as lot of innards are exposed in terms of the implementation of the store. This if done completely automagically would be great-2 time-saver. Just look at the queries and either create required indexes or drop them. Lot of regression issues for performance are introduced as small changes start accumulating in the application and are introduced at database level.

  1. Make scale out/up easier

Again this is exposed to the end application designer in terms of what entities should be sharded vertically or horizontally. This ties back to 1 in terms of queries ingestion or query. This makes or breaks the application in terms of performance and has impact on evolution of the application.

  1. Make the “adoption” easier by using existing declarative mechanism for interaction. Today one has to choose the store’s way rather than good old DDL/DML which is at least 90% same across systems. This induces fatigue for ISVs and larger enterprises who look at cost of “migration back and forth”. Declarative mechanisms have this sense of lullaby to calm the mind and we indulge in scaleup first followed up scaleout (painful for the application).

Make sure majority of the clients are on par with each other. We may not need something immediately for a rust. But at least ensure php, java, .net native and derived languages have robust enough interfaces.

Make it easier to “extract” my data in case I need to move out. Yes I know this is the least likely option where resources will be spent. But it is super-essential and provides the trust for long term.

Lay out in simple terms roadmap – where you are moving so that I do not spend time on activities which will be part of the offering.

Lay out in simple terms where you have seen people having issues or wrong choices and share the workarounds. Transparency is the key. If the store is not good place for doing like latest “x/y” work – share that and we will move on.

  1. Do not make choosing the hardware a career limiting move. We all know-stores like memory. But persistence is key  for trust. SSD/HDD, CPU/Core, Virtualization impact – way too much of moving choices to make. Make 70-90% scenarios simple to decide. I can understand some workloads require lot of memory or only memory – but do not present swarm of choices. Do not tie down to specific brands of storage or networking which we cannot live to see after few years.

In the hosted world – pricing has become crazier – Lay out in simple to understand terms how costing is done. In a way licensing by cores/cpu was great because I did not have think much and pretty much over-provisioned or did a performance test and moved on.

  1. Resolve HA /DR in reasonable manner. Provide simple guide to understand hosted vs host your own worlds. Share in clear manner how should the clients connect, failover. We understand Distributed systems are hard and if store supports distributed world – help us navigate the impact, choices in simple layman terms or something we are already aware of.

If there’s an impact in terms of consistency – please let us know. Some of us care more about it than others. Eventual is great but the day I have to say – waiting for logs to get applied so that reports are not “factual” is not something I am still gung-ho about.

  1. Share clearly how monitoring is done for the infrastructure in either hosted/host your own cases. Share a template for “monitor these always” and take these z actions – sort of literal rulebook which makes again makes adoption easier.
  1. Share how data at rest, data in transport can be secured, audited in simple fashion. For the last piece – even if actions are tracked – we will have simple life.
  1. Share simple guide for operations, day to day maintenance – This will be a life saver in terms of x things to look out for, do backups, do checks. This is how to do HA, DR check, performance issue drilldown – normally part of the datahead’s responsibility. Do we look out for unbalanced usage of the environment? IS there some resource which is getting squeezed? What should we do – in those cases?

Points 1-4 make adoption easier and latter help in continued use.

Dentist appointment calender on a Paper

My wonderful Dentist has had this wonderful form of calendar for a decade. He has moved to acquire latest equipments for surgery, x-ray but refused to adopt a simple effective system like Practo(congratulations for mention in economist) which his other colleagues are using.

Dentist gave me many reasons

- easy for receptionist/him to take a pencil/eraser and modify

- use both sides of very small piece of paper for the day

- easily put up time/procedure – also providing inputs for inventory

- green – less power consumption than device/internet charges etc

- controls his own data – very crucial for him

HA, Resilient, green

HA, Resilient, green, cheap too.

 

 

 

AzureML – Zero to Hero talk at SQL User Group – Bangalore

Update – Presentation (2nd Aug 2014 – SQL UG meetup, Bangalore)

This weekend I have a slot – thanks to @vinodk_sql, @pinaldave, @banerjeeamit and @blakhani for speaking on AzureML. I will assume most of the folks are for “finding” out what is new and take them on a journey from that point of view. hopefully they will leave excited about the tool and its ease and become curious enough to take the journey.

We will cover what AzureML can do by way of examples, we will get basic idea about what is available out of box (data ingestion, model creation, validation and web publishing – request/response). We will cover at high level algorithms for various tasks and need for data cleansing/feature selection and available tools for the same. We will not go deep into R integration or tuning of algorithms(sweeping/active-online learning). We will sidestep into gory details of what each algorithm means but cover the metrics for evaluation which are important to see the gains of using the algorithm.

It will be a demo heavy session taking data from public sites.

Venue:
Microsoft Corporation, Signature Building,
Embassy Golf Links Business Park,
Intermediate Ring Road,
Domlur, Bangalore – 560071

Location and details are here.

And title of the talk was chosen by Vinod.

No Not DataScience just Data Analysis

Over last 6+ years we have worked with various folks who wanted to learn more from data. This has been more of learning for us

1. Subsidized items beneficiaries – This is very big initiative with potential of pilfering, multiple entitlements. We focused on multiple  entitlements with available digital information.
– missing addresses.
– straight forward same household address
– wrong/unverified addresses with missing documents
– same person name spelled differently and related person information
spelled slightly differently
– having a “presence” across multiple locations farther apart
– missing biometric information where it was required
– corrupted biometric data
– missing “supporting” documents

Most of the issues of dubious addresses/missing/questionable documents indicates
issues at various levels(acceptance, ingestion, approval).

2. Subsidized healthcare data
This enables people to take care of critical health issues in subsidized
fashion.
We found out lot of obvious data issues
– plastic surgery repeated for different body parts for same folks over
years
– people doing delivering kids in short period
– certain districts doing lot more claims overall for surgeries(u, burns,
additional stent
– stay in icu for neuro but medicines for something else
– stay for whipples(oncology surgery) of any kind, increased Mastectomy of any kind without district data showing increase. May be it is just a co-incidence.
– Ureteric Reimplantations, Paediatric Acute Intestinal Obstruction larger
than others.

 

3. Elector Data
Challenges here range from missing supporting data, duplicate information.
The duplicates or just findings were very interesting
– people living in temples (sadhus are apparently exempt), schools
– multiple families living across various parts of state (labor on move)
– people thinking multiple voter-id cards helps to take advantage of some
gov schemes like ration/subsidized food or just as backup in case one is
lost
– woman married to 4 people …(possible in certain tribal locations)
– people with various versions of name (first, name, family) at same
address with little variation of age thrown in too

4. Non performing assets in lending firms
This sort of bubbled up when the corebanking effort took place and lot of
database “constraints” had to be loosened up to enable uploading in some
places.
– This reflects in lot of accounts with very less substantiated documents
and them turning into NPAs over time.
– Specially bad for the co-operative agencies where governance is very
less.

This was the one case where we used simple classification/clustering
mechanisms to simplify our analysis.

5. Rental cab agency
This was unique in terms of “cost” control measures. One particular trip
always used to consume more fuel then compared to normal transport. It was
found cab drivers congregate outside the expensive parking to avoid paying
it and thus end up using more fuel to come in and pick up customers.
Certain locations/times always again always had bad feedback in terms of
response- reason being drivers located far away with cheaper/no parking
or having food/rest in cheaper location.

At times I would have loved to throw data to blackbox which could throw
back questions and beautiful answers. Honestly more time was spent in
getting data,cleaning, re-entering missing data – (surgery description diff
than type). Later on simple grouping/sum/avg/median (stats) kind of
exploration threw up lot of information that we found.

 

 

 

The other “requirements” of the managed datastores in cloud

We(me and @Vinod - author of  extremexperts) have supported migration to managed SQLAzure stores for quite sometime. Customers like ease of manageability, availability and decent performance.

There is another class of customers who keep getting pushed for “consolidating” databases and manage them for SLAs (DR/HA,backups-go-back- intime-x,performance). These databases are not in TBs but range from few GBs to 100s of GBs.

1. There is need for synchronization with on- premise databases and gasp sometimes it needs to be bidirectional.

2. There is need of meeting security SLAs by providing auditing views, encryption.

Promise of cloud where it enables ease of management/availability also needs to enable these scenarios. Hopefully in future we will get these.

Nginx on Azure

Nginx works on Azure, absolutely no issues. It has very vast capabilities. I came to know of few of them only when customer requested that discussion.

1. Ability to control request processing – Customer wanted to throttle number of requests coming from a particular IP address. This was easily done with limit_req module directive. It allowed easy throttling behavior defn, what to do when limits are reached, crossed. Logging is done for these kind of requests and ability to send specific http error message is possible. (503 is enough). It also enables storing the state of current excessing requests. Another learning was to use $binary to help pack a little bit more – though it does make it difficult to decipher in simple way. So in the http block

limit_req_zone $binary_remote_addr zone=searchz:10m rate=5r/s;

followed by location (end points which need this – login/search)

location = /search.html { limit_req zone=searchz nodelay; }

This protects very nicely against http issues but does not protect against ping floods and other ways people can do ddos for your application. This is best prevented/controlled in some kind of appliance (hw) or at least iptables. That though is different subject alltogether. There is another directive

2. Splitting clients for testing – This too is very easily done in the configuration with split_clients directive. It can also be used to set specific querystring parameters very easily.

Yes there are dedicated services/apps to do achieve same functionality – but it is wonderful to learn everyday. Customer/Partners are King and honestly  great teachers.

NoSql Stores and aggregations

In normal db world we are comfortable running queries in creating statistics (sum/min/max/avg), some percentile. Achieving this in efficient way we climb the pre-aggregated  world of olap across dimensions. If we need some kind of range/histograms/filters – we try to anticipate them and provide ui elements and again push of queries to datastore.  With bunch of in-memory columnar storages – we try to do them on fly. With MPP  systems we are comfortable doing it when required.

Over time need has come up to have aggregations to be created in declarative
manner.

Approach in ElasticSearch is pretty nice addition. ES works across the cluster.

In a way to understand – yes ES too are queries of search kind , but declarative model makes them better to fathom.

top_hits agg (coming in 1.3)
Parent/Child support not yet.

It definitely looks little bit more than facets – as the composability is key.

With Cassandra – you do the extra work while writing. But generally it is not
composable and can only do numeric functions. This requirement is tracked at high level here.

Database engines sort of gave away this field by quoting standards, issues of performance/consistency rather than creating decent enough declarative mechanism.90% of queries in DB world will become simpler. Materialized view was the last great innovation and we stopped there. Product designer/engineers should be made to look at the queries people need to write to get simple things done.