Disclaimer – I do not work for Qubole. I am also not an analyst. I am just fascinated by the Database spring that we are witnessing for last couple of years. I maintain the hope of eventual cheaper/better/faster option(s) . (update) Heck I still remember what Data General used to offer – the hardware to software solution. Way we used to schedule jobs, ask for quotas and damn – lifting the disks too from storage area to compute area. That was compute/storage on demand too :). Sir you want to analyze sales of quarter x – please bring your data, schedule your job and wait out on the console for your turn.
There are three kind of approaches vendors(new or old) have taken to Hadoop’s presence.
Traditional DB vendor with “integration story”
1. We will help you store/retrieve cold/processed data in/from hdfs, you can do your fancy jobs there, aggregate the data and we can extract it back here. Our tools can help do dumping/extraction/cleaning up. Our existing engine can help you serve workloads much better.
2. We can do a query across two stores – relational & hdfs (using own query mechanism or integrated into hive).
Traditional DB or the newSQL vendor with “Memory is cheap” –
Here traditional db is the dominant usage scenario – and messaging remains- not everybody is fb/twitter/amazon to require these solutions.
1. We can add massive memory and still use the simple database without changes to access/store models (will support Mohan, will support notions of buffer pool/locks). This will suffice for many without sophisticated hw(infiniband/storage magic) underneath.
2. The columnar access pattern dominates the workload, let us optimize for it, compress/store those maps in memory.
3. Let us take a leap of faith and do away with “buffer pool” and related latches/locks but maintain parity with SQL, ACID which developers understand.
Traditional DB vendors have challenges for
– Horizontal Scale out based on data as partitioned data requires awkward compromise for the columns/keys
– On other hand as the shared nothing scaleout happens – maintaining developer calm by providing min consistency – pushing changes in sync to x replica, pushing reads to replicas becomes an issue.
Pure Hadoop based vendors
– Get more efficient filesystem, add memory based cache, add something more than just mr pattern, compression at storage, HA(fixing it in innovative ways), improving operations (overcoming accidental deletes/differential backups/replications across dc? )
– Push changes which will benefit everybody into main trunk in public repository (YARN for instance or HBase)
Hosted Hadoop & services
1. Will help you create UX/Command line based clusters, change settings, monitor conditions.
2. Will really go ahead and fix/add things which are missing and make the hosted platform more appealing
3. Add security features(authentication/storage)
Qubole lies in the 2nd type of hosted vendors. Why did they attract so much love and respect from me personally? A vendor who goes and creates following deserves all the kudos.
1. Way to create quotas, Kill Mode, TestMode to data extraction/massaging world – knows what happens in real world. (Mistakes/learning on the job/bad data most of the time). Biggest of them – automagically create horsepower to create nodes based on data in jobs(?) + heuristics and scale it down
2. Missing features – upsert (how imp is that for data movement), move data out of partitions of hive(again solving practical issues)
3. Really take advantage of cloud vendor’s abilities – add/test hybrid/spot instances/ (bidding/timeout for the instances/%age of spot instances)
-If they add on-premise option to work with traditional private cloud provider – this will end search for other options.
-Working with ISVs to bundle it is alltogether different ballgame.
Another disclaimer – These are my own opinions as humble data person and do not reflect my employer. I just look at what is delivered/documented in public domain.
(update) This does not mean pure hadoop vendors are not ahead in fixing enterprise issues/meeting requests, actually they are far ahead, it is the hosted platform which is the point of discussion in this 1o min post. Some of the pure hadoop dist vendors have tougher task of thinking through what remains inside vs available outside. Training/Mentoring/competing with existing enterprise db/app sales can’t be long term goal when people are focusing on “solutions” – http://www.theregister.co.uk/2012/11/11/police_ibm_analysis_crime_prevention/. This post also bypasses excellent MPP systems and innovative ways they integrate with HDFS or Hadoop ecosystem as I have never been able to look (forget access).