GIDS 2014 – Learning from Data – A busy SW professional’s guide to Machine learning

For last couple of years we have been helping customers do analysis of data to find insights which are not normally found through reports/kpis. I must extend gratitude here – first is Anand S – whose simple visual analytics helped me move forward. For years I have read Tufte and other books but mostly from usability angle. With D3.js, ggplot2 and wonderful pandas/scikit and R tool set life has become simpler to quickly clean, analyze data. Thanks Anand for sharing your stories.

Another is my team-mate – Vinod - who encouraged me to share rather than try to be perfect as he saw me struggle/learn through last 3 years with various engagements as cohort for customers. We realized active learning, reinforcement learning, bramble forest, NDCG, GINI co-efficient and the humongous maths/algebra/statistics is all important. Algorithm choice is important. But we spent more time in cleaning/organizing the data, we spent more time understanding/getting frustrated why data is not telling us something. Also thanks to Pinal for extending the invite and pushing.

Others are tools/discussion lists – Wise.io, BigML, SkyTree, R, Scikit, Pandas, Numpy, Weka and vowpal wabbit folks. The folks in Cloudera trying to integrate that English firm myrrix?, that database which does all approximation, all the small startups (import.io, mortardata, sumologic to everybody else).

Our own toolset in SQL Server was very dense and difficult to adopt – required way too much ceremony.

It is important to create simpler way to understand the path to the field, de-mystify it so that rest of travellers on this journey do not have issues that we had. With that intention I am presenting in GIDS 2014 Bangalore

I am excited as ever because we get to meet a different set of audience and the expectations are completely different.

Event Location: J. N. Tata Auditorium
National Science Symposium Complex (NSSC)
Sir C.V.Raman Avenue, Bangalore, India

The complete schedule is published here.

My talk is on 25th April 2014.

Time : 11:40-12:25 

This session is targeted at folks who are curious about machine learning and want to get a gist by looking at examples rather than dry theory. It will be a crisp presentation which takes various datasets and uses bunch of tools. Intention here is to share a way to comprehend what is involved at high level in machine learning. Since the ground is very vast this session will focus on applied usage of Machine learning with demos using Excel, R, Scikit and others. You will walk out with what it means to create a model using simple algorithms, evaluate a model. Idea here would be to simplify the topic and create enough interest so that attendees can go and follow-up on topic on their own using their favourite tool.

I am also hoping I will get to bump into familiar friends (Pinal Dave, Amit Bahree, Sunil, Praveen, Balmukund) and hopefully Erick H and the whole Solr gang, Siva (qubole), Regunath (Aadhar fame), Venkat.

Data movement in Azure

Over 3-4 years we have had multiple customers who 

-  wanted to create multi-tenanted system for their customers for doing pipelined text processing. The end result was generally lucene or its derivative (solr or elasticsearch) and data would usually be sourced from a relational/mongo/cassandra(exported data) or azure’s table or blobs.
- wanted to enable hosted analytics system for their customers by picking up their data and executing aggregation across multiple sources. This involved data movement from various relational(6) datastores, mongo(2),cassandra(1).
- wanted to create backup solutions (not exactly block level replication) at usermode file/directory level.

In general we have used same basic concepts of an endpoint, multiple queues for breaking down the work internally and having  worker roles accomplish the required task.

We have worked with customers to create adhoc systems which have sort of worked but what we are really looking for is something  like Chronos(AirBnb) running on Apache Mesos framework.

Why
- Ability to scaleout the system
- Isolation of linux containers so that systems are “low impact”
- Multi-resource scheduling(memory/cpu aware) – in a multi-tenated system this is very important

Challenges in creating a data pipeline

Following are some of the learnings as these systems were designed and deployed in jiffy. They never provided luxury of time.

- Diverse data sources across the onpremise and public cloud with different behavior – bulk vs streaming ( we did not tackle them ).  For the text processing pipelines – this was usually a html file.
- Cleanup of the data (transformations, addition of new data, doing lookups). It is generally addition of data based on exisiting data that we have seen.
- Simpler things like – pick up directory of data vs zipped data vs single large file
- Scheduling – Pickup/drop/retry/Recurring?
- Distributability of the task to parallelize the work. This requires thinking through the “graph” of the work and seeing what can be done independently. With a patient customer we could possibly do a little better than what we have done over time.
- Error handling vs notification
Errors of some kind are ok and “handled” and some require notification across multiple channels.
- What kind of task we want to support
a) Out of the box templated tasks against azure services
b) Custom code
There were some customers who wanted to execute a “hosted hadoop job” and track it via their own api end point (we never did  this)
8) Ability to define or infer dependencies between various tasks or individual schedule.

Components
Service Endpoint + Orchestrator + specialized task queue + workers
An orchestrator creates the “pipeline” and persists the state of the tasks. A well defined worker machine picks up the task, updates the status of pick and progress into status of the task.
Unfortunately something like copying which is native to os would have to be repeated if machines restarts. Many tools are not written to expose “progress” and this needs to be either inferred or custom \tools need to be created to achieve this work.
Task queue with pre and post task conditions - this was essential but rudimentary synchronization mechanism and most of the issues were expected here in terms of what are these conditions. Pre-conditions were limited to table name, or folder/file
location presence
Data sources and destinations - usually databases, azure table or blob store. (nope we did not do “hadoop job integration” – we got lost on provisioning/confirming things and our offering was not api-oriented at that time)
Schedule - much more simpler – cron style job. We had discussion of “event” triggered job pipe – but we never achieved it in either of the 1st implementations.
External notification service - which pools the “states” of the running/stopped/failed jobs and shoots out notifications

Majority use case

Copy data from one Azure hosted asset to another

What we were not good at or pursue 
-  Dynamic creation of resources (vms from predefined pool, table locations or folders)
- We tried using “state” persistence service but gave up once we migrated to simpler option.
- Copying database backup (one of the recent tweet mentioned the trouble we have had in transferring the data). Azure Powershell interface has little bit more capability compared to Service management API.

Only John Foreman could pull this one off

Well two of them.

1. Awesome easy to grok book  on DataScience- Thanks John for creating something so simple,. I guess your book did not receive the As from the cool kids who are stumbling on latest ML package. You chose the workhorse of them all to explain underpinning of every  approach (clustering/regression/Bayes….) - Excel. Gasp.

But you know what you just wrote the most easy to understand – approach to Data Analysis. Wished this was included as applied ML – in coursera/edx etc.

From the days of the  dealing with dealer to this – you just nailed it. Waiting for the next  episode.

2. You taking on Tufte defending powerpoint.  Glorious. Nobody messes with that God and all this while you were getting the mana from him. Priceless.

I could go on talking about wonderful natural use of language, humor in the book. I guess I will let readers go and get it.

If you are one of those people who prefer a book to learn from and refer to than this book is one of them.He throws in R translation for good measure. For folks who have taken deeper course in ML techniques this will be a shocker as he does not delve into deep lasso or gibbs method or your favorite technique of reducing parameters.

It is a great companion to R book An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani .

For me this book is in category of learning cum reference book in likes of Kernighan, haskell, little schemer,  Don Box’s essential com,  Java concurrency book, clojure programming(chas emerick) or the unix book, Eric Sammer’s Hadoop Operations.  I would dare put it in skiena’s category. That is just me. This books goes over pitfalls, explains the approach in a crisp way.   

Next book to wait for – Nathan Marz.

High Availability options for Databases(postgres/oracle/sqlserver/mysql) on Azure

This post was created after having worked with multiple customers requesting information about High Availability options on Azure for different databases. In general High availability has to be thought ground up for a system. The power, gateway access, internal redundant network paths, hardware – disk/cpu/mc fail – everything has to be thought through. Since Azure takes care of the utilities(power/nw paths etc) – we need to focus on local HA within datacenter and DR across datacenter to on-premise/another location.

Most folks are aware of and have used some kind of cluster based services which provides local fail over. For example disk issues are taken care by storage systems and respective raid systems. Some of the cluster services also provide load balancing of requests – or others which redirect reads to secondaries. Their working is not the focus for this discussion.  We are also going to assume client side XA transactions are generally not adopted or great idea.

Traditionally Clustering technology and good SAN was required to provide local HA for database. Cloud platforms have created level field and obsoleting the requirement of expensive SANs and heavy cluster requirements (In some databases earlier – machine configuration had to be exactly same etc).

read-replica

Conceptually most of the relational databases follow above picture in faith. There are various mechanism to synchronize databases but some databases allow seamless fail over to secondary, read replicas and client connection management. In pure cloud setup in one data-center one will try to synchronized transactions either through log push/pull mechanism in two local instances  and take the log and apply to secondary in other locations for disaster recovery purposes.  This is very similar to on-premise setup of these databases. 

With SQL server  AlwaysOn on Azure is ready at present for use. The fail over for clients is automatic and secondary takes over the primary as databases are in sync. 

With Oracle at present Active Dataguard running in pretty much the same way as SQL server is the prefered path. At present SCAN feature (which is part of the RAC) with managed fail over is not present in ADG. For SQL server folks SCAN is concept similar to Listener in AlwaysOn. This does mean there is impact on RTO. There is Golden Gate option too for folks requiring that kind of functionality.(basically multi-master). There are bunch of features in latest Oracle databases to help automatic client failover but applications need to take those into account.

Postgres SQL too has choices (I have not tested streaming) for HA . Admittedly it is little low level in terms of precautions – for e.g., requiring  - adjusting log file sizes -wal_keep_segments for streaming. Edit – And no there is no suggestion to go around and try to do custom cluster solution or pgpool/pgbouncer as clouds have intolerance for holding IP addresses. Pgpool solves is very expansive and solves too many problems. This basically means one has many things to look for when there are issues. For .Net/Sql folks – think of PgPool akin to Sql AlwaysOn (minus the postgres specific features like load balancing , replication, parallel query), PgBouncer is just out of process – connection pooling process.

MySql -has well established HA/DR - master HA (simplest being master slave combination). But anything requiring proprietary cluster service/hw etc is not suggested for cloud environment. MHA (from DeNa) is closest to  SqlServer AlwaysOn in terms of concepts. At present I do not have idea about Tungsten which also provides application failover.

Concept of providing availability is same across all the relational databases

1. Create virtual network – to isolate the workload(more importantly retain an IP in Azure)

2. Create availability set of two database servers to ensure the pair does not go down at once for maintenance.

3. Create primary server and choose option for pushing/pulling data from secondary(push requires secondary to be up) – I am glossing over setting up wsfc-for-sqlserver, listener, active directory config, setting up archive mode on primary, taking backup and applying it to secondary)

4. Create additional read-replicas if the technology from the database allows it.

5. Create remote location for DR to on premise in asynchronous way. Another cloud location requires cloud to provide inter data center connectivity.

6. If technology like listener as in SqlServer is supported – configure it to provide failover for local setup

Monitoring of the setup is required to detect network partition, workload issues on either of the machines.

For Azure – availability set provides concept for ensuring machines within an availability set do not all go down for a planned maintenance job (by Azure team). For Cloud 2 use Multi-AZ at min for local HA for now to host the machines and push out to region for DR for database. For cloud 3 one has to use again a concept of zones to isolate – “locally” plus replicate to region for database. (for truly moving application – lot more has to be done, but we will restrict here to the database)

ElasticSearch on Azure – sure

Why

1. Get the event logs, errortraces, exceptions in one location and enable powerful search which can scale out seamlessly. Ideally one could/should use
logstash-(poor man’s *plunk alternative)
2. Create a search frontend for your application for frequently looked up items, cached items or just regular search based system what you would do for
catalog of items, issues(customer pain points) or gulp even primary data store for certain kind of applications

We have used and proposed Solr earlier – lately elastic search’s monitoring and simplicity of scaleout/availability is what has made us to push this lucene based alternative more for customers.

When you would not use this kind of search service
If there is hosted native search service which offers cheaper storage and better query times (based of faster backend) or you are ready to pay the $ for given throughput and storage.

How

sudo apt-get openjdk-7jdk
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.tar.gz
tar -xzf elasticsearch-0.90.7.tar.gz

/bin/elasticsearch -f (and you start dumping/querying the data) or put in init.d

Monitor
ElasticHq – http://www.elastichq.org/support_plugin.html (available as hosted version too)
Kopf – https://github.com/lmenezes/elasticsearch-kopf
BigDesk – https://github.com/lukas-vlcek/bigdesk/ (more comprehensive imho)
OOB stats – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
Paid – http://sematext.com/spm/elasticsearch-performance-monitoring/

Learnhttp://www.elasticsearch.org/videos/bbuzz2013-getting-down-and-dirty-with-elasticsearch/
Tips
Use Oracle JDK
Use G1 GCC(http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All)

Kibana/logstash too work without any issues.

Caveat - Azure does not support multicast, so discovery becomes based on unicast – mechanism and pretty much coded into the configuration file.

Spark on Azure using Docker – works

For past few weeks trying out docker and found it useful to convey the need of lightweight containers for dev/test.

It is based on cgroups and lxc. lxc has extremely simple cli interface to use and run with(as a user I remember being excited by solaris containers long time ago). Docker makes it much more powerful by adding version and reusability imho.

I used it on Azure without issues. When Spark’s docker friendly release was mentioned by Andre it was on my to do list for long time. Intent was to run the perf benchmark using memetracker dataset – will get it on fullfledged cluster one of the days.

Everything mentioned at the repo worked without issues – I just cloned the docker scripts directly. The only change was for the cloning, I used following statement

git clone http://github.com/amplab/docker-scripts.git

Challenge with any new data system is to learn - import/export of data, easy query, monitoring , finding out root cause. That will require some work in real project - somewhere down the road. Got distracted by use of Go in docker in between.