Suprotim and Sumit pushed me to publish an article on “Decision Making Pivots for adoption of Cloud” – This is basically gist of guiding principles we use to help customers to migrate to cloud. We have few variations for enterprise strategy where workloads like exchange(email), sharepoint (collaboration) or CRM need to move to cloud. We have a colleague MS Anand who helps customer on the Private cloud adoption front to create efficiencies out of existing infrastructure. Here is the document which focuses on Azure and was part of the magazine. Azure Adoption – pivots to help make right decision
I just completed something else I promised Suprotim – an article on comparing Hive for people who are used SQL as dialect to interact with Database. Although comparison of database and hive is not strictly apple-to-apple comparison. I wanted to take an approach where understanding BigData does not become a burden of learning MapReduce/hdfs and overall hadoop ecosystem. It is much easier to start doing something very simple that we do with regular data store and try to do it with Hive and then start looking at differences. It also helps to understand why HDFS and map-reduce are helping in addressing scale and availability for very large amount of data. Although there are tools like Pig/Cascalog/Scalding/Cascading- I decided to focus on HiveQL as it is closest to SQL dialect with simple intention of not introducing many new things simultaneously. Once the article is out for a month in the magazine – I plan to share it here again or you can pick it from http://www.dotnetcurry.com/magazine/dnc-magazine-issue2.aspx (updated – 1st Sep 2012) once it comes online.
And if everything goes allright with help and push from Vinod & Pinal – I will devote energies toward something more useful.
Update – 1st Sep 2012 – Things I have not covered in Hive Article-
* There is tab for DisasterRecover on hosted azure which when switched on –
It is not clear WRT NameNode whether FsImage and EditLogs are backed up every x minutes to another “managed” location.
* Is there a secondary namenode where log/checkpoint gets shared to.
* There is a secondary namenode but execution of command against it in RDP simply hangs.
* WRT HDFS data
* Is data snapshotted/backed up to Azure storage and takes advantage of inherent replication there. (updated – preferred storage is azure storage rather than local nodes)
* WRT Hive Metadata – Is it backed up to “Managed” location every x hours/minutes (no clear idea)
* If NameNode crashes – Not clear now (WRT to HadoopOnAzure) – Whether AppFabric services inherent in Azure are utilized to identify and bring it up & use the earlier “managed” location (using –importCheckpoint option)
* Upgrade & rollback of underlying version will be part of HOA’s lifecycle management. Assumption here is at present one version will be prevalent across tenants. Upgrading individual clusters to different version is not supported. (updated-2013 December – upgrade to new version is supported)
* Addition/deletion of nodes into existing cluster (still a manual job)
* Adding incremental data (updated – normal import process)
* Adding a Fair scheduler
* Monitoring the job progress/cancellation (updated DEc-2013 – powershell based)
* Identifying bottlenecks in JVM/hdfs settings (completely roll your own)
* Dealing with hadoop fsck identifying bad/missing blocks and related issues (do you own)
* Rebalancing the data (do your own)
Updated – 20th Sep 2012
Cloudera posted a wonderful article on using flume, oozie and hive to analyze the tweets. http://www.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/