Attended well conducted event at InMobi on Hadoop. All the kudos to InMobi folks to open up and share not just their work(yoda) but awesome food/drinks. Rarely have I seen a platter more generous than at yesterday’s event.
Event – Event hosters Vinayak Hegde(Data platform owner), Sharad Agarwal (ex-yahoo,hadoop,yarn committer, present platform head at InMobi) were punctual, humble and kind. Vinayak and Sharad provided needed time checks, context and hoped to continue the effort with help from community. Turnout was varied right from recent newhires to people with multiple decades in industry. It had lot of yahoo (no-surprise) folks, Nokia (100 node cluster), Huawei (apparenly built a HA and have deployed cluster of x nodes), NetApp (Bejoy and team with y node cluster) , Mu-sigma(evaluating and using various pieces of hadoop). Joydeep (ex facebook, hive creator) came around to see/meetup folks. I was looking for raptor folks from Sungard though.
Sonal on crux - This talk had two pieces – how crux allows uses api based interaction with mapping and reporting of data inside HBase. Her intention was to get people to contribute and help out build the other moving pieces.Crux at present goes directly rather than in between translation from SQL like query language. She explained how one needs to design backend carefully to ensure efficient/performant data access is done. Crux allows composite keys, filters but eod secondary indexes and like need to be thought about by the system designer. Since Crux is just a reporting tool – it can only do so much(idea is to be nice to get/range operations – how much a db guy likes these operators – seek vs scan(killl)). Kudos to her for getting something out and talking about it.
Sharad‘s talk on next generation Hadoop clarified the present constraints and hence the goals of .23 world. HA – Restart of Namenode vs DataNode, Scalability of NameNode – sheer footprint of everything it needs to keep track of and respond to, Need to support alternative parallelizable algo-with no force fitting into MR. His talk was succint and filled with great depth. Idea of containers getting resource fullfillment from ResourceManager to get themselves created via NodeManager and then spawning off AppMaster to look after the applifecycle independently is the key. Application lifecycle of either MR/Iterative/MPI could be managed independently and Resources can still be managed centrally. Important takeaway for older installations – no change. Their world remains the same. Data Affinity based container spawning is possible – this looks interesting from perspective of reducing network io.
Yoda talk by Gaurav from InMobi was about the inhouse Datawarehousing/Reporting tool they built with few resources over short period of time. It was slick. He explained the pragmatism to do custom development compared to using Hive or other tools. One of the important pivot for decision making was documentation, community support and “in-ordinate” spawning of Jobs without taking into picture metadata and layout of data. Looks like good solution to their issues and allows them evolution according to their needs. It is niether designed as generic framework nor does it aim to be one. This honesty from the data framework team was refreshing as they were not trying to boil the ocean and focused on their constraints (lack of massive clusters) , needs of analysts (inhouse/publishers). It would have been great to see how they choose plans for execution, is it cost based or .., which operators they push up or down and on what basis – if it based on metadata of the data – how do they keep updating it?
From my perspective – it was also great meeting up folks like abhinasha from bizosys – thanks buddy for the beer and leading the assault on food counter. For old person like me - I am still looking at easier ways to adopt as end user.
- SQL DSL front end (for loading PIG is ok but presence of sqoop, scribe is explosion of choices – lot of time is spent in evaluation)
- debugging the performance of a given Query , how many combiners, partitioners, which operator gets mapped to how many jobs and how it takes care of affinity to data location (ideally less I get to know, much productive I will be), Relational DBs have made me lazier & biased , Way to extract only given amount of data.
- Monitoring/Prioritizing and concurrent access for read/write are what will get us relational folks in easier way to that world.
The day MapR, Hadapt combine & provide statspack(ish), DMVish, monitoring support, is the day real revolution will become. (For record I have not had time to look at much appreciated cloudera distribution).