Data Ingestion and Store stories

In about last 6 months we have had good fortune to understand/implement 6 solutions for customers who need fast ingestion and then some kind of analytics on it. This is a gist of those interactions, what worked, what did not fly, workarounds.

These solutions pushed us to explore things not available out of box on the platform. We
were exposed to
– 3 customer designing monitor/analyze/predict solution. They had existing inhouse
platform but it requires local storage, changes involve changing the software/hardware.
2 of them did not even have “automated monitoring” – a person would go – “note down” the
reading on paper/smartphone web-app and then this would be aggregated and stored in
central location.
– All of them wanted to move to public cloud except one who wanted something they could
deploy on-premise too.

– Electricity
– Pharma manufacturing
– Healthcare
– Chemical/heavy metal manufacturing

Data sources
– Sensors
– linux/windows embedded devices collecting aggregating floor/section/machine wise data
– Humans entering data

– almost everything except healthcare varied from 10s of minutes to hours.

Data size
– Since data could get buffered/batched/massaged depending on situation. Never more than an MB.
– Few Hundred Kbs

In-Order/One time delivery guarantees?
– Very pragmatic customers – they were ok to define an error rate rather than insisting
on specifics.

Not even one wanted direct “sending” of data to “store”. They wanted local/web
intermediate processing. This why this internet of things where protocols are rigid and stores fixed was surprise for us all.
How to ingest fast
– what could be the front end
– does it make sense to have intermediate queue

How to scale the store 
– always capture- key condition

How to query with low latency
– search/lookup specific items – for logs/keywords/facet around them
– aggregates/trends
– detailed raw reports
– help in “outage”/”demand” – constant across electricity/manufacturing
– definition of outage/demand change

What works as store
– if you think of read queries before hand as it dominates the design (CQL or otherwise)
*** all facet kind of stuff – which is sort of groupby + no relevancy is dependent on
how data is stored.
– scales – scales and scales
– Reads are pretty good and many of “aggregates” which do not require as of last
millisecond/second resolution – can be done by simple running jobs which store these
“more latency” items in another store – k/v or relational and generally cached
aggressively. (mostly flushed out after x entries to another store)

– Push out data for other kind of analysis to fav store – hdfs and absorb into other
Challenge is monitoring(infra vs running of the Cassandra and parameter impact) and
skillset upgradation.

– At times customer have used store – numeric/other data in cassandra and push –
unstructured data(stack trace/messages/logs) out to lucene derivative –
Solr/ElasticSearch. Challenge has been “consistency” at given time but generally works.

How to ingest/Broker
– WebApi front ends pushing data into broker (rabbitmq/msmq/kafka) – mostly based on
experience and comfort factor
** To try – Akka/Orleans + Storm for Near real time analytics
** Only one brave soul still doing Kafka + Storm – painful to manage/monitor

Need better “Monitoring across the stack” tools.

Multi-tenancy is another issue which blows up due to sku differentiation where sometimes
data can shared/but updates-patching becomes an issue.
Data movement in Azure becomes a bigger issue and we have implemented as mentioned here.



Data Ingestion and Store stories