Poetry and Coursera

I love poems/short stories and pick them up from time to time when I get some solitude.  Following are 2 poems (one by W. Stevens and another by Bodhisattva in hindi).  Then there is a link for coursera for poetry.

Continual Conversation With A Silent Man (Wallace Stevens)

The old brown hen and the old blue sky,
Between the two we live and die–
The broken cartwheel on the hill.

As if, in the presence of the sea,
We dried our nets and mended sail
And talked of never-ending things,

Of the never-ending storm of will,
One will and many wills, and the wind,
Of many meanings in the leaves,

Brought down to one below the eaves,
Link, of that tempest, to the farm,
The chain of the turquoise hen and sky

And the wheel that broke as the cart went by.
It is not a voice that is under the eaves.
It is not speech, the sound we hear

In this conversation, but the sound
Of things and their motion: the other man,
A turquoise monster moving round.

A course for poetry is on coursera -  http://blog.coursera.org/post/44385047250/my-coursera-experience-modern-poetry-for-a-physics

A poem about a poem by Bodhisattva (from jankipul)

कमजोर कविता
वह एक कमजोर कविता थी
कवि उसे छिपाता फिरता था
कुछ बहुत कमजोर विचार थे उसमें
कहने के लिए कुछ थोड़े शब्द थे घिसे पिटे
न अलंकार था न छंद न समास
नीरस सी जहाँ-तहाँ रूखे भाव उफनाए से दिखते थे उसमें।
उस कविता में न थे गंभीर तत्व
ऐसे ही थी जैसे बेस्वाद पापड़-चटनी-अंचार।
वह बहुत कमजोर कविता थी
जैसे असुंदर बीमार बेटी
जैसे कुरूप कर्कशा पत्नी
जैसे अपाहिज औलाद
जैसे मंद बुद्धि बैठोल पति
जैसे धूल धूसर जर्जर घर
जैसे फूटी ढिबरी
जैसे खेत ऊसर बंजर अनुर्वर
(read rest on the link above)

नको नको रे पावसा …!! – इंदिरा संत

The days when I long for rain are many. I believe I can be happiest in a clouded place , the ones where clouds of different shapes color play a riot in the sky. That also means I love rains the ones who listen to you. Here’s a beautiful poem by Indira Sant. She converses with rain, actually requests to take bring her lover back from a distance with care by using lightning :) , not to dirty her backyard. 

नको नको रे पावसा …!!

 नको नको रे पावसा
असा अवेळी धिंगाणा
घर माझे चंदमौळी
आणि दारात सायली;
 
नको नाचू तडातडा
असा कौलारावरन,
तांब सतेलीपातेली
आणू भांडी मी कोठून?
 
नको करू झोंबाझोंबी
माझी नाजूक वेलण,
नको टाकू फुलमाळ
अशी मातीत लोटून;
 
आडदांडा नको येउं
झेपावत दारातून,
माझे नसेूचे जुनेरे
नको टांकू भिजवून;
 
किती सोसले मी तुझे
माझे एवढे ऐक ना,
वाटेवरी माझा सखा
तयाला माघारी आण ना;
 
वेशीपुढे आठ कोस
जा रे आडवा धावत,
विजेबा, कडाडून
मागे फिरव पंथस्थ;
 
आणि पावसा राजसा
नीट आण सांभाळून,
घाल कितीही धिंगाणा
मग मुळी न बोलेन;
 
पितळेची लोटीवाटी
तुझ्यासाठी मी मांडीन,
माझ्या सख्याच्या डोळयांत
तुझ्या विजेला पाजीन;
 
नको नको रे पावसा
असा अवेळी धिंगाणा
घर माझे चंदमौळी
आणि दारात सायली….
 
इंदिरा संत

Respecting a schemaless(dynamic schema) store’s strength

Recently I was contacted by multiple folks using a nosql/datastructure store for guidance about daily operational issues, bad data & lot of finger pointing.

In one of the case – The main product manages the click impressions and tries to give “perspective” on views of the ad. Unfortunately this is “mixture of solutions” each with own application life cycle and ownership.

Issues

1. Mismatched data types
Incident – Application 2 started storing arrays where string was expected. It must have been because of the change in other app when requirement changed from “single value” to multi-valued item. . This error was caught in the few filter queries & reporting intermittently. It was difficult to catch the issue as it would happen for certain filters. Other impact is increased storage. If this information was/is small  -a simpler type could have represented the data.

2. Subtle change in the name of the document element resulting missing data
Incident – Some application started reporting incomplete data. Cause was wrong name change resulting in a new field resulting in bigger bug.

3. Change in data type precision
Incident – suddenly values did not match up in reconcilliation, a float was changed to integer (gasp).

In document store by the nature of it being a document database it is very-easy to simple mistakes

e.g.,

>db.test.save( { x:1 } )

>db.test.find()

– Now store string for same key without an issue
>{ “_id” : ObjectId(“508b3d791a1d9f06773cb597″), “x” : 1 }
>db.test.save( { x:”str” } )

>db.test.find()

>{ “_id” : ObjectId(“508b3d791a1d9f06773cb598″), “x” : “str” }

> db.campaigndetails.save (
{ _id: “xkcd”, “details” : { campaign_id: 1, inventory_id: 2, audience_attr:3 },rev:10 })

> db.campaigndetails.find( {“_id”:”xkcd”} )
{ “_id” : “xkcd”, “details” : { “campaign_id” : 1, “inventory_id” : 2, “audience_attr” : 3 }, “rev” : 10 }>
> db.campaigndetails.find( {“id”:”xkcd”} )
Is that a store’s issue? Nope it is the user in this case a developer’s responsibility to take right measures so that these situations are avoided. What are other repercussions?  Let us say there is a “flow of data” to another place. All the etl/messaging kind of processing have greater chance of failing if the type changes or extra field gets added.
How can one overcome it

1. Use common data repository access code
2. Use common data repository validation code
3. Use common messaging transformation code
4. Use basic layout verification code
5. Since many of the document databases do not support joins folks created either  embedded/hierarchical or linked structure. Embedded schema has an implication for changing the “value” of related item and many issues get reflected here.

6. Reconcilliation of catalog (schema) with data values – this can be run once in a while to look for any deviation from the agreed upon schema

Another datastructure store example showing dynamic schema

redis 127.0.0.1:6379> set car.make “Ford”
OK
redis 127.0.0.1:6379> get car.make
“Ford”
redis 127.0.0.1:6379> get car.test
(nil)
redis 127.0.0.1:6379> get car
(nil)
redis 127.0.0.1:6379> get car.Make
(nil)
redis 127.0.0.1:6379> set u ‘{ “n”:”f_name”}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”:\”f_name\”}”
redis 127.0.0.1:6379> set u ‘{ “n”: 1}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”: 1}”

Sadly people are throwing baby with bathwater in their hurry and forgetting somebody will need to maintain the Data and derive value from it. Data is sacred and it’s sanctity should be always maintained.

There is an issue of schema design which has  influence in some of the above issues. Rather than blindly “embedding” documents a simple check should be done for “immutability”. If there is possibility of change ,lot of data is getting duplicated with growth. It is better to keep it separate. Another suggestion would be to keep the names extra-small.

 

Flipkart & 5th Elephant – BigData event

Flipkart and its submission for talks at 5th elephant – a BigData event puts them way above rest. I have never seen any firm bare and share so much about technology stack. Last time I attended hasgeek event they shared how they scale, next time they talked about front end work. The sheer # of talks & spectrum is immense this time around.

If I was a startup – this is like a crash course in getting the right thinking. I have not seen its famed bigger international lookalike/potential local competition ever share these many details , closest firms which are like them are netflix (@adrianco) or linked in to an extent. Twitter is ahead in terms of sharing the tools with everybody else.

This goes very well with Kiran’s well written article – there he talked about sharing the source when there could be potential of misuse, outright piracy. Flipkart on other hand is sharing its moving pieces at least by talking about them. Very-2 bold imho.

Where is monetization/relevance for Social (search via friends)

Update – 13 Nov 2012 – At sometime in october 2012, one of my friends reached out for a job in java world. I tried out linkedin echochamber. Only few friends responded but surprisingly linked in could not guess the question in the post and redirect interested folks to me? I would have expected that linked-in showing how my question/ask for help has reached x people who might be interested and I should connect with them or they having option to send back automatic requests for information/job advertisements. I personally think linked in is the most data driven company for very specific need and they can and will overcome that barrier.

Update – 23 OCt 2012 – One more thread on Facebook Bangalore foodies indicates “need for social site monitoring and effective monetization” . There is two fold possibility. Sensing Contributor statement as question or comment – push relevant stuff to him/her. When just viewer comes in & spends active time on thread – push stuff to him in near real time with relevance. “Refrigerator with deal x” located y km from where you are :) .

Will  refrigerator ad not be relevant here for "Reader"

Will refrigerator ad not be relevant here for “Reader”

Update- 6 July 2012 – There is SearchBuddy – MSR research project which talks about “embedded bot kind of service”  to identify & answer questions. It is little sparse on how, time span limit, relevance quality.

I follow a group for my interests (what else – but food) - http://www.facebook.com/groups/BLRfoodies/ on facebook. It is pretty active, honest community and I respect the folks who spend time here.

First is a specific food suggestion -

Next is the generic ask for a good location for large bunch of people

Following is a very specific ask about getting herb(seeds/plants) from specific locations

Now the generic advertisements which are thrown when I am looking at that page

Image

One has to respect FB(Mark), Twitter(revenue guy) for accepting they do not know yet what to do.  There are folks like bloomreach/reclabs who are trying to bring relevance for merchants.  Just how does one surface context sensitive piece based on who is the person? Search is easy – you are looking for information.

But just browsing – wall, timeline – there are chances that some of these you explore. Some of  them you ignore. Main issue is kind of information one shares/looks for in each of the networks. Then offcourse one has linkedIn(which should buy stackoverflow) which many folks keep segregated and others aggressively integrate and send same information everywhere. Quora is extremely high value site where people who are part of the network add that “dash” by providing real answers. Each one of them serves different need.

Apart from segregation of “interest” – major issue is trust – search engines are explicit – they sell our intent probability. But moment fb/twitter start doing something which voilates that trust – people might switch back to closed network – phone call/email.

Personally I have never clicked on ads. Ever.

Hadoop Meetup at InMobi

Attended well conducted event at InMobi on Hadoop. All the kudos to InMobi folks to open up and share not just their work(yoda) but awesome food/drinks. Rarely have I seen a platter more generous than at yesterday’s event.

Event – Event hosters Vinayak Hegde(Data platform owner), Sharad Agarwal (ex-yahoo,hadoop,yarn committer,  present platform head at InMobi) were punctual, humble and kind. Vinayak and Sharad provided needed time checks, context and hoped to continue the effort with help from community. Turnout was varied right from recent newhires to people with multiple decades in industry. It had lot of yahoo (no-surprise) folks, Nokia (100 node cluster), Huawei (apparenly built a HA and have deployed cluster of x nodes), NetApp (Bejoy and team with y node cluster) , Mu-sigma(evaluating and using various pieces of hadoop).  Joydeep (ex facebook, hive creator) came around to see/meetup folks. I was looking for raptor folks from Sungard though.

Sessions

Sonal on crux - This talk had two pieces – how crux allows uses api based interaction with mapping and reporting of data inside HBase. Her intention was to get people to contribute and help out build the other moving pieces.Crux at present goes directly rather than in between translation from SQL like query language. She explained how one needs to design backend carefully to ensure efficient/performant data access is done. Crux allows composite keys, filters  but eod secondary indexes and like need to be thought about by the system designer. Since Crux is just a reporting tool – it can only do so much(idea is to be nice to get/range operations – how much a db guy likes these operators – seek vs scan(killl)). Kudos to her for getting something out and talking about it.

Sharad‘s talk on next generation Hadoop clarified the present constraints and hence the goals of .23 world. HA – Restart of Namenode vs DataNode, Scalability of NameNode – sheer footprint of everything it needs to keep track of and respond to, Need to support alternative parallelizable algo-with no force fitting into MR. His talk was succint and filled with great depth. Idea of containers getting resource fullfillment from ResourceManager to get themselves created via NodeManager and then spawning off AppMaster to look after the applifecycle independently is the key.   Application lifecycle of either MR/Iterative/MPI could be managed independently and Resources can still be managed centrally. Important takeaway for older installations – no change. Their world remains the same.  Data Affinity based container spawning is possible – this looks interesting from perspective of reducing network io.

Yoda talk by Gaurav from InMobi was about the inhouse Datawarehousing/Reporting tool they built with few resources over short period of time. It was slick.  He explained the pragmatism to do custom development compared to using Hive or other tools. One of the important pivot for decision making was documentation, community support and “in-ordinate” spawning of Jobs without taking into picture metadata and layout of data. Looks like good solution to their issues and allows them evolution according to their needs.  It is niether designed as generic framework nor does it aim to be one.  This honesty from the data framework team was refreshing as they were not trying to boil the ocean and focused on their constraints (lack of massive clusters) , needs of analysts (inhouse/publishers).  It would have been great to see how they choose plans for execution, is it cost based or .., which operators they push up or down and on what basis – if it based on metadata of the data – how do they keep updating it?

From my perspective – it was also great meeting up folks like abhinasha from bizosys – thanks buddy for the beer and leading the assault on food counter.  For old person like me  - I am still looking at easier ways to adopt as end user.

- SQL DSL front end (for loading PIG is ok but presence of sqoop, scribe is explosion of choices – lot of time is spent in evaluation)

- debugging the performance of a given Query , how many combiners, partitioners, which operator gets mapped to how many jobs and how it takes care of affinity to data location (ideally less I get to know, much productive I will be), Relational DBs have made me lazier & biased , Way to extract only given amount of data.

- Monitoring/Prioritizing and concurrent access for read/write are what will get us relational folks in easier way to that world.

The day MapR, Hadapt combine & provide statspack(ish), DMVish, monitoring support, is the day real revolution will become. (For record I have not had time to look at much appreciated cloudera distribution).

Phpcloud event roundup – personal view

After a long time I atttended well concieved event.

[updated : 11th July 2011] : Urls of site/people/projects

[updated : 11th July 2011] : Renaming and few corrections/disclaimer.

This is not complete view as I could not attend other fabulous sessions around membase, memcache,choosing right php framework and gearman. But I did hear great things about them, hoping to catch them on videos.

Good things-

1. Event websitecontent, payment process, organizers

Conceptualization of event was validated with potential speakers, their talks and votes of interest. Kiran took pains to look at public profile and github of speakers. A very nicely designed  website for potential attendees was the first temptation – really a blowback to vendor backed agency cookie cutter output. Although not all talks were focused on php or deep php , the theme connected cloud hosted php website.

Smooth checkout/payment integration was another plus. Wished there was a button, come later details will be preserved, as I struggled to justify staying out for full day away from little kiddo and other stuff as i entered details multiple times.

Another good thing was no over hype of changing the world kind of thing. A bunch of professionals who wanted to provide a reasonable venue for exchange of thoughts. More importantly no marketing only types or speaking and no work types projecting themselves as gift to humanity. This was apparent in most of the speakers especially zynga, facebook, capillary, mobstac & some yahoo folks.(after the event impression meeting/exceeding the expectation).

2. Event venue – Dharmaram college inside christ college with good parking and plenty of trees was idyllic setup. College had really good quality projectors in place. Food, tea, biscuits were good- how can I say anything bad about food which provides pineapple gojju :) . Restrooms were plenty and clean. Water was available when needed. Only challenge was Internet, which I personally thought was a good riddance.  Then there was event photographer Kushal Das omnipresent with his gadgetry.

3. Content & people

My expectations were  around learning best practices across the technologies. It was wonderful to see pythonistas around prepping for pycon. There were lot of offline discussions around chef, puppet, ganglia, capstriano. I saw folks coding up in both django and a template I could not make out plus pure lamp stack or just plain cgi.

Flipkart continous deployment

I personally liked flipkart’s sessions – real people sharing real insights, best practices with humility previously not seen elsewhere. Flipkart’s sharing of how they do continous deployment – major 6 + releases everyday – I thought I saw 30 , although speaker could have done with double shot of coffee ;) , a person who knows his stuff  but sort of shy in speaking up. There are people with less than ounce of his depth who create unfortunately much perception of depth – they can talk without single shipping application.

Anyway coming back to content – main take away was use of debian packages to do deployment. They provision bunch of new machines after netboot to deploy right os, puppet to do configuration. Then finally debian packages are used to deploy the  code base, dependencies etc. Easy to roll back, audit. Some times somethings are  still done by hand(schema change). They can do rolling deployments most of the times. They do use very heavy cache and solr for search. This sessiom focussed on how every  commit results in build, deployment. They replicate as much of production environment into dev, something we push many customers at least for db size.

45 minutes were too short into get details of what tests they run, how they do perf testing, how they profile, use monitoring ( gomez/keynote) vs pingdom etc). Or dig into usage of logs to raise alerts, or dynamic monitoring of resource usage across php, jvm, nginx, ha proxy or varnish, network components. As they shared later in another session they have home made load balance running linux in ha mode with multiple network cards to allow 3 networks down. Amol  Amod (thanks Mekin) clarified lot of questions wrt deployment, again very humble down to earth folks.

Flipkart performance tips

This talk by Siddarth was worth its time in Gold. Flipkart uses php minus apache front ended by ha proxy, varnish and nginx. I am assuming they use memcache. Explanation of various options and why they chose what they had was well detailed.

I think/assume usage of solr takes off whole lot of load and thrifted services are invoked only for real work with db. Since they do not store any state anywhere, near horizontal scale is possible at web, services layer. They have interesting component in java as threaded service/deamon which runs locally on every web server. This component  is the one which does async, parallel calls, does logging etc. Imho this is really the guts and choke point of app and I really could not get the point of this abstraction as they could talk from php to thrift service(except parallel calls). Availability of this component, failover was something time did not permit to get into. Caching, performance tools across proxy,cache(static-dynamic), web server, php, jvm(any particular gc setting?), solr,cache, db was not delved in deep as time was less. It would would been awesome to get into discussions around why of templates or no templates, models in php vs abstracted behind service layer. Similar discussion around use of jquery vs other frameworks, usage of cdn would have been great. One thing I did not see was use of any queue or analytics(not really part of the topic so okay) or nosql, functional lang ( was expecting some scala vs clojure adoption) discussion. Very-2 pragmatic and generous. Almost too generous for sharing pretty much their core. Even today it is difficult to find how amazon does its website scaling inspite all talk of openness etc.  Flipkart’s another good thing was that they did not boast because they use a “tech/lang”.

Practo and usage of beanstalkd

This particular talk by practo co-founder Abhinav lal went into how he made the choice in favor of beanstalkd over rabbitmq or expensive message buses. He rightly called his usage as job queue where he pushes priority based messages into tube and has workers to process those messages. He uses ttr and persistence ( not clear whether it is automatically purged  after some time) to pull messages and process. Usage of supervisord to scale out, monitor the worker process was a neat trick he shared. Looked like he used php on worker as well as client side so serialization of data was no big deal. Overall beanstalkd came out as decent job queue and excelling at it. There is no fancy publish/subscribe, esb, jms infra on top of it. Again a great speaker.  I did bump into beanstalkd via delayedjob earlier. So good validation of “scale” vs “usage”. Again a great value for money discussion.

Failure as an option by Vijay from zynga

This was the talk delivered by a person who had guts to say things as they are. Specifically around cloud – unbound resources, elasticity, provisioning etc….basically burst lot of myths… importantly he had imp advice – it is not possible to know/control all unknowns. So at least control your destiny where you can. He adviced to have few reserved or passive instances to overcome the tyranny of ‘avalanche ‘ when cloud provider faces downtime, all clients try to provision more resources. He added other ideas around why/how to failover completely for x userbase vs half baked response to everybody. Now this requires lot of work and thinking at lot of layers – it would be ideal to apologize only to people whose service went down rather than fail whale for everybody. His suggestion of usage of ha proxy intelligently, not failing over all load of 5 machines to remaining 5 ( out of 10)if those 5 machines are at 50% + usage already. This would start cycle of provisioning and more load on already stressed infra.

Ravi pratap‘s talk turned out to be  ec101 rather than being true to title of scaling. He I am sure could have done justice to that content too as he is great speaker with lot of hands on work. His explanations were crisp and he shared one imp thing thing- they have their own ami as expected. Again time was too short for him to get into details of scaling as covering basics took longer time. But would expect at some other location he can share his tips.

Eucalyptus talk was big let down unfortunately as it would have been easier to start of with demo rather than explain cloud again…certainly not cool to run down valid competition as malware.

Python tool fabric as deployment option by capillary (Nigel Babu) was completely new subject for me.

Thrift talk by capillary(Piyush Goel) on how they use it for communication from php to java service  was good tutorial. Main adv over protocol buffers was lack of rpc in protobuf.

Again thanks to @hasgeek for conjuring up goood meet. http://phpcloud.hasgeek.in

Conducted on 9th july 2011