Over 3-4 years we have had multiple customers who
- wanted to create multi-tenanted system for their customers for doing pipelined text processing. The end result was generally lucene or its derivative (solr or elasticsearch) and data would usually be sourced from a relational/mongo/cassandra(exported data) or azure’s table or blobs.
- wanted to enable hosted analytics system for their customers by picking up their data and executing aggregation across multiple sources. This involved data movement from various relational(6) datastores, mongo(2),cassandra(1).
- wanted to create backup solutions (not exactly block level replication) at usermode file/directory level.
In general we have used same basic concepts of an endpoint, multiple queues for breaking down the work internally and having worker roles accomplish the required task.
We have worked with customers to create adhoc systems which have sort of worked but what we are really looking for is something like Chronos(AirBnb) running on Apache Mesos framework.
- Ability to scaleout the system
- Isolation of linux containers so that systems are “low impact”
- Multi-resource scheduling(memory/cpu aware) – in a multi-tenated system this is very important
Challenges in creating a data pipeline
Following are some of the learnings as these systems were designed and deployed in jiffy. They never provided luxury of time.
- Diverse data sources across the onpremise and public cloud with different behavior – bulk vs streaming ( we did not tackle them ). For the text processing pipelines – this was usually a html file.
- Cleanup of the data (transformations, addition of new data, doing lookups). It is generally addition of data based on exisiting data that we have seen.
- Simpler things like – pick up directory of data vs zipped data vs single large file
- Scheduling – Pickup/drop/retry/Recurring?
- Distributability of the task to parallelize the work. This requires thinking through the “graph” of the work and seeing what can be done independently. With a patient customer we could possibly do a little better than what we have done over time.
- Error handling vs notification
Errors of some kind are ok and “handled” and some require notification across multiple channels.
- What kind of task we want to support
a) Out of the box templated tasks against azure services
b) Custom code
There were some customers who wanted to execute a “hosted hadoop job” and track it via their own api end point (we never did this)
8) Ability to define or infer dependencies between various tasks or individual schedule.
Service Endpoint + Orchestrator + specialized task queue + workers
An orchestrator creates the “pipeline” and persists the state of the tasks. A well defined worker machine picks up the task, updates the status of pick and progress into status of the task.
Unfortunately something like copying which is native to os would have to be repeated if machines restarts. Many tools are not written to expose “progress” and this needs to be either inferred or custom \tools need to be created to achieve this work.
Task queue with pre and post task conditions - this was essential but rudimentary synchronization mechanism and most of the issues were expected here in terms of what are these conditions. Pre-conditions were limited to table name, or folder/file
Data sources and destinations - usually databases, azure table or blob store. (nope we did not do “hadoop job integration” – we got lost on provisioning/confirming things and our offering was not api-oriented at that time)
Schedule - much more simpler – cron style job. We had discussion of “event” triggered job pipe – but we never achieved it in either of the 1st implementations.
External notification service - which pools the “states” of the running/stopped/failed jobs and shoots out notifications
Majority use case
Copy data from one Azure hosted asset to another
What we were not good at or pursue
- Dynamic creation of resources (vms from predefined pool, table locations or folders)
- We tried using “state” persistence service but gave up once we migrated to simpler option.
- Copying database backup (one of the recent tweet mentioned the trouble we have had in transferring the data). Azure Powershell interface has little bit more capability compared to Service management API.