Poetry and Coursera

I love poems/short stories and pick them up from time to time when I get some solitude.  Following are 2 poems (one by W. Stevens and another by Bodhisattva in hindi).  Then there is a link for coursera for poetry.

Continual Conversation With A Silent Man (Wallace Stevens)

The old brown hen and the old blue sky,
Between the two we live and die–
The broken cartwheel on the hill.

As if, in the presence of the sea,
We dried our nets and mended sail
And talked of never-ending things,

Of the never-ending storm of will,
One will and many wills, and the wind,
Of many meanings in the leaves,

Brought down to one below the eaves,
Link, of that tempest, to the farm,
The chain of the turquoise hen and sky

And the wheel that broke as the cart went by.
It is not a voice that is under the eaves.
It is not speech, the sound we hear

In this conversation, but the sound
Of things and their motion: the other man,
A turquoise monster moving round.

A course for poetry is on coursera -  http://blog.coursera.org/post/44385047250/my-coursera-experience-modern-poetry-for-a-physics

A poem about a poem by Bodhisattva (from jankipul)

कमजोर कविता
वह एक कमजोर कविता थी
कवि उसे छिपाता फिरता था
कुछ बहुत कमजोर विचार थे उसमें
कहने के लिए कुछ थोड़े शब्द थे घिसे पिटे
न अलंकार था न छंद न समास
नीरस सी जहाँ-तहाँ रूखे भाव उफनाए से दिखते थे उसमें।
उस कविता में न थे गंभीर तत्व
ऐसे ही थी जैसे बेस्वाद पापड़-चटनी-अंचार।
वह बहुत कमजोर कविता थी
जैसे असुंदर बीमार बेटी
जैसे कुरूप कर्कशा पत्नी
जैसे अपाहिज औलाद
जैसे मंद बुद्धि बैठोल पति
जैसे धूल धूसर जर्जर घर
जैसे फूटी ढिबरी
जैसे खेत ऊसर बंजर अनुर्वर
(read rest on the link above)

Teched 2013 – Bengaluru & Pune

TechEd 2013 is here, wow ten years on and I have been involved with TechED India as speaker, organizer. It has been a great ride and for last few years we have tried different things. Main theme being getting people different from MS background getting invited to speak.
This year again we have
1. Rajat(ex vertica dev) and Siva(greenplum query optimization guru) talking about Hosted Data platform(Qubole) – on how their customers are using the platform and finding it useful and changes they are making to provide scale and performance.(yes that is big Data coverage, we are covering HDInsight in data track)
2. S. Anand – famed and very well known speaker/data scientist sharing his wisdom on how data with right kind of visualization drives pragmatic insights in compressed time.
3. Saurabh Gupta from HFI – takes on UX and explains misconceptions that he has seen over years.
4. Amit Bahree & Ramkumar from Avanade – explain their take on how to develop multi-device applications using various toolsets. (we tried really hard to get Xamarin )
5. Mandar Kulkarni from Netmagic will try to share his experience around managing/provisioning datacenters.
6. Pete Brown is travelling all the way from US to share tips around building LOB apps on Windows 8 platform.

We have bunch of Sharepoint/O365 sessions lined up as we have new release on both fronts.
1. Abhisek & Aniruddh will share what has changed in new platform and what should/can be migrated.
2. Abhisek will also be doing a lap around 0365 explaining new feature sets relevant to SI/ISV crowd.
3. Amartya from Infosys plans to share best practices in development/deployment for Sharepoint culled from years of sweat and blood. (only in Bangalore)

Then we do have Azure related session by Sudhanshu Hate from Infosys around creating hybrid applications on Azure platform. (only in Pune)

How can we forget the data platform

Vinod Kumar delves into Availability options in SQL server 2012. He will go in deep as I know he was planning to write a book around the subject.

Dates -
Bengaluru – 18/19th March
Pune – 25/26th March

Website – https://india.msteched.com

What I could not achieve 

Functional language coverage :( – sadly one of the most experienced person is in Pune but we could not cover t&e. Clojure rocks ! (BigML/Cascalog is elegant)
Javascript – Again css/js is the future for long term and wished we could cover it.
Fun Stuff – Kinect based fun stuff – again lack of t&e

Machine learning too was dropped as the key person was offline :( .

Hosted Hadoop platform – why Qubole is setting the pace

Disclaimer – I do not work for Qubole. I am also not an analyst. I am just fascinated by the Database spring that we are witnessing for last couple of years. I maintain the hope of  eventual cheaper/better/faster option(s) . (update) Heck I still remember what Data General used to offer – the hardware to software solution. Way we used to schedule jobs, ask for quotas and damn – lifting the disks too from storage area to compute area. That was compute/storage on demand too :) . Sir you want to analyze sales of quarter x – please bring your data, schedule your job and wait out on the console for your turn.

There are three kind of approaches vendors(new or old) have taken to Hadoop’s presence.

Traditional DB vendor  with “integration story” 

1. We will help you store/retrieve cold/processed data in/from hdfs, you can do your fancy jobs there, aggregate the data and we can extract it back here. Our tools can help do dumping/extraction/cleaning up. Our existing engine can help you serve workloads much better.

2.  We can do a query across two stores – relational & hdfs (using own query mechanism or integrated into hive).

Traditional DB or the newSQL vendor with “Memory is cheap” –

Here traditional db is the dominant usage scenario – and messaging remains-  not everybody is fb/twitter/amazon to require these solutions.

1. We can add massive memory and still use the simple database without changes to access/store models (will support Mohan, will support notions of buffer pool/locks). This will suffice for many without sophisticated hw(infiniband/storage magic) underneath.

2. The columnar access pattern dominates the workload, let us optimize for it, compress/store those maps in memory.

3. Let us take a leap of faith and do away with “buffer pool” and related latches/locks but maintain parity with SQL, ACID which developers understand.

Traditional DB vendors have challenges for

- Horizontal Scale out based on data   as partitioned data requires awkward compromise for the columns/keys

- On other hand as the shared nothing scaleout happens – maintaining developer calm by providing min consistency – pushing changes in sync to x replica, pushing reads to replicas becomes an issue.

Pure Hadoop based vendors 

- Get more efficient filesystem, add memory based cache, add something more than just mr pattern, compression at storage, HA(fixing it in innovative ways), improving operations (overcoming accidental deletes/differential backups/replications across dc? )

- Push changes which will benefit everybody into main trunk in public repository (YARN for instance or HBase)

Hosted Hadoop & services

1. Will help you create UX/Command line based clusters, change settings, monitor conditions based on a distribution X.

2. Will really go ahead and fix/add things which are missing and make the hosted platform more appealing

3. Add security features(authentication/storage)

Qubole lies in the 2nd type of hosted vendors. Why did they attract so much love and respect from me personally?  A vendor who goes and creates following deserves all the kudos.

1. Way to create quotas, Kill Mode, TestMode to data extraction/massaging world – knows what happens in real world. (Mistakes/learning on the job/bad data most of the time)

2.  Missing features – upsert (how imp is that for data movement), move data out of partitions of hive(again solving practical issues)

3. Really take advantage of cloud vendor’s abilities – add/test hybrid/spot instances/ (bidding/timeout for the instances/%age of spot instances)

Unsolicited advice

-If they add on-premise option to work with traditional private cloud provider – this will end search for other options. 

-Working with ISVs to bundle it is alltogether different ballgame.

Another disclaimer – These are my own opinions as humble data person and do not reflect my employer.  I just look at what is delivered/documented in public domain.

(update) This does not mean pure hadoop vendors are not ahead in fixing enterprise issues/meeting requests, actually they are far ahead, it is the hosted platform which is the point of discussion in this 1o min post. Some of the pure hadoop dist vendors have tougher task of thinking through what remains inside vs available outside. Training/Mentoring/competing with existing enterprise db/app sales can’t be long term goal when people are focusing on “solutions” - http://www.theregister.co.uk/2012/11/11/police_ibm_analysis_crime_prevention/. This post also bypasses excellent MPP systems and innovative ways they integrate with HDFS or Hadoop ecosystem as I have never been able to look (forget access).

नको नको रे पावसा …!! – इंदिरा संत

The days when I long for rain are many. I believe I can be happiest in a clouded place , the ones where clouds of different shapes color play a riot in the sky. That also means I love rains the ones who listen to you. Here’s a beautiful poem by Indira Sant. She converses with rain, actually requests to take bring her lover back from a distance with care by using lightning :) , not to dirty her backyard. 

नको नको रे पावसा …!!

 नको नको रे पावसा
असा अवेळी धिंगाणा
घर माझे चंदमौळी
आणि दारात सायली;
 
नको नाचू तडातडा
असा कौलारावरन,
तांब सतेलीपातेली
आणू भांडी मी कोठून?
 
नको करू झोंबाझोंबी
माझी नाजूक वेलण,
नको टाकू फुलमाळ
अशी मातीत लोटून;
 
आडदांडा नको येउं
झेपावत दारातून,
माझे नसेूचे जुनेरे
नको टांकू भिजवून;
 
किती सोसले मी तुझे
माझे एवढे ऐक ना,
वाटेवरी माझा सखा
तयाला माघारी आण ना;
 
वेशीपुढे आठ कोस
जा रे आडवा धावत,
विजेबा, कडाडून
मागे फिरव पंथस्थ;
 
आणि पावसा राजसा
नीट आण सांभाळून,
घाल कितीही धिंगाणा
मग मुळी न बोलेन;
 
पितळेची लोटीवाटी
तुझ्यासाठी मी मांडीन,
माझ्या सख्याच्या डोळयांत
तुझ्या विजेला पाजीन;
 
नको नको रे पावसा
असा अवेळी धिंगाणा
घर माझे चंदमौळी
आणि दारात सायली….
 
इंदिरा संत

Respecting a schemaless(dynamic schema) store’s strength

Recently I was contacted by multiple folks using a nosql/datastructure store for guidance about daily operational issues, bad data & lot of finger pointing.

In one of the case – The main product manages the click impressions and tries to give “perspective” on views of the ad. Unfortunately this is “mixture of solutions” each with own application life cycle and ownership.

Issues

1. Mismatched data types
Incident – Application 2 started storing arrays where string was expected. It must have been because of the change in other app when requirement changed from “single value” to multi-valued item. . This error was caught in the few filter queries & reporting intermittently. It was difficult to catch the issue as it would happen for certain filters. Other impact is increased storage. If this information was/is small  -a simpler type could have represented the data.

2. Subtle change in the name of the document element resulting missing data
Incident – Some application started reporting incomplete data. Cause was wrong name change resulting in a new field resulting in bigger bug.

3. Change in data type precision
Incident – suddenly values did not match up in reconcilliation, a float was changed to integer (gasp).

In document store by the nature of it being a document database it is very-easy to simple mistakes

e.g.,

>db.test.save( { x:1 } )

>db.test.find()

– Now store string for same key without an issue
>{ “_id” : ObjectId(“508b3d791a1d9f06773cb597″), “x” : 1 }
>db.test.save( { x:”str” } )

>db.test.find()

>{ “_id” : ObjectId(“508b3d791a1d9f06773cb598″), “x” : “str” }

> db.campaigndetails.save (
{ _id: “xkcd”, “details” : { campaign_id: 1, inventory_id: 2, audience_attr:3 },rev:10 })

> db.campaigndetails.find( {“_id”:”xkcd”} )
{ “_id” : “xkcd”, “details” : { “campaign_id” : 1, “inventory_id” : 2, “audience_attr” : 3 }, “rev” : 10 }>
> db.campaigndetails.find( {“id”:”xkcd”} )
Is that a store’s issue? Nope it is the user in this case a developer’s responsibility to take right measures so that these situations are avoided. What are other repercussions?  Let us say there is a “flow of data” to another place. All the etl/messaging kind of processing have greater chance of failing if the type changes or extra field gets added.
How can one overcome it

1. Use common data repository access code
2. Use common data repository validation code
3. Use common messaging transformation code
4. Use basic layout verification code
5. Since many of the document databases do not support joins folks created either  embedded/hierarchical or linked structure. Embedded schema has an implication for changing the “value” of related item and many issues get reflected here.

6. Reconcilliation of catalog (schema) with data values – this can be run once in a while to look for any deviation from the agreed upon schema

Another datastructure store example showing dynamic schema

redis 127.0.0.1:6379> set car.make “Ford”
OK
redis 127.0.0.1:6379> get car.make
“Ford”
redis 127.0.0.1:6379> get car.test
(nil)
redis 127.0.0.1:6379> get car
(nil)
redis 127.0.0.1:6379> get car.Make
(nil)
redis 127.0.0.1:6379> set u ‘{ “n”:”f_name”}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”:\”f_name\”}”
redis 127.0.0.1:6379> set u ‘{ “n”: 1}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”: 1}”

Sadly people are throwing baby with bathwater in their hurry and forgetting somebody will need to maintain the Data and derive value from it. Data is sacred and it’s sanctity should be always maintained.

There is an issue of schema design which has  influence in some of the above issues. Rather than blindly “embedding” documents a simple check should be done for “immutability”. If there is possibility of change ,lot of data is getting duplicated with growth. It is better to keep it separate. Another suggestion would be to keep the names extra-small.

 

Update – Article on Cloud and Hive

Suprotim and Sumit pushed me to publish an article on “Decision Making Pivots for adoption of Cloud” – This is basically gist of guiding principles we use to help customers to migrate to cloud. We have few variations for enterprise strategy where workloads like exchange(email), sharepoint (collaboration) or CRM need to move to cloud. We have a colleague MS Anand who helps customer on the Private cloud adoption front to create efficiencies out of existing infrastructure. Here is the document which focuses on Azure and was part of the magazine.  Azure Adoption – pivots to help make right decision

I just completed something else I promised Suprotim – an article on comparing Hive for people who are used SQL as dialect to interact with Database. Although comparison of database and hive is not strictly apple-to-apple comparison. I wanted to take an approach where understanding BigData does not become a burden of learning MapReduce/hdfs and overall hadoop ecosystem. It is much easier to start  doing something very simple that we do with regular data store and try to do it with Hive and then start looking at differences. It also helps to understand why HDFS and map-reduce are helping in addressing scale and availability for very large amount of data. Although there are tools like Pig/Cascalog/Scalding/Cascading- I decided to focus on HiveQL as it is closest to SQL dialect with simple intention of not introducing many new things simultaneously.  Once the article is out for a month in the magazine – I plan to share it here again or you can pick it from http://www.dotnetcurry.com/magazine/dnc-magazine-issue2.aspx (updated – 1st Sep 2012) once it comes online.

And if everything goes allright with help and push from Vinod & Pinal – I will devote energies toward something more useful.

Update – 1st Sep 2012 – Things I have not covered in Hive Article-

-There is tab for DisasterRecover on hosted azure which when switched on – It is not clear
—–WRT NameNode
———–FsImage and EditLogs  are backed up every x minutes to another “managed” location.
—–Is there a secondary namenode where log/checkpoint gets shared to
———–There is a secondary namenode but execution of command against it in RDP simply hangs.
—–WRT HDFS data
———–Is it snapshotted/backed up to Azure storage and takes advantage of inherent replication there.
—–WRT Hive Metadata
———–Is it backed up to “Managed” location every x hours/minutes

-If NameNode crashes – Not clear now (WRT to HadoopOnAzure)
———–Whether AppFabric services inherent in Azure are utilized to identify and bring it up  & use the earlier “managed” location  (using –importCheckpoint option)

-Upgrade & rollback of underlying version will be part of HOA’s lifecycle management. Assumption here is at present one version will be prevalent across tenants. Upgrading individual clusters to different  version is not supported.

- Addition/deletion of nodes into existing cluster

- Adding incremental data

- Adding a Fair scheduler

- Monitoring the job progress/cancellation

- Identifying bottlenecks  in JVM/hdfs settings

- Dealing with hadoop fsck identifying bad/missing blocks and related issues

- Rebalancing

 

Updated – 20th Sep 2012

Cloudera posted a wonderful article on using flume, oozie and hive to analyze the tweets. http://www.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

Hive article & Azure adoption article

I recently wrote an article for DotNetCurry - putting down experiences with Azure migration pivots.

I will be jotting down experience and basics of hadooponAzure in same way.

I intend to cover how to look at hadoop from database user’s point of view. It will cover storage, query & loading of Data using apache projects such as pig/hive. We will try not get down to map-reduce jobs as starting point as they tend to cloud the judgement for adoption for administration/developers familiar with SQL dialect. They depend on it to define schema, query. We will cover availability (strongest point of hdfs), scalability (hdfs -easy adding of nodes) , querying (pig/hive) . This article will not cover machine learning, performance tuning of hdfs/mr jobs, installation, management/monitoring.

I will try to publish a link to the word document with errata for the azure article here one of the days.