10 things I wished my datastore would do (updated: Is DocumentDB my savior?)

We use datastores generally to ingest data and try to make some meaning out of it by means of reports and analytics. Over years we have had to make decisions in terms of adopting different stores for “different” workloads.

Simplest being the Analysis – where we offload to pre-aggregated values with either columnar or distributed engines to scaleout the volume of data. We have also seen rise of stores which allow storage of data which is friendly for range of data. Then we have some which allow very fast lookups, maturing to doing aggregations on run. We have also seen use of data-structure stores – the hash table inspired designs vs the ones which don sophisticated avatars (gossips, vector clocks, bloom filters, LSM trees).

That other store which pushed compute to storage is undergoing massive transformation for adopting streaming, regular oltp (hopefully) apart from its regular data reservoir image. Then we have the framework based plug and play systems doing all kind of sophisticated streaming and other wizardry.

Many of the stores require extensive knowledge about the internals of the store in terms of how data is laid out, techniques for using right data types, how data  should be queried, issues of availability and taking decisions which are generally “understandable” to the business stakeholders. When things go wrong – the tools differ in range from just log error to actual “path of the execution” of the query. At present there is lot of ceremony for thinking about the capacity management, issues around how data changes are logged and should be pushed to another location. This much of detail is great “permanent job guarantee” but does not add lot of value in long term for the business.

2014-22nd Aug Update – DocumentDB seems to take away most of the pain – http://azure.microsoft.com/en-us/documentation/services/documentdb/

  1. Take away my schema design issues as much as it can

What do I mean by it? Whether it is traditional relational databases or the new generation no-sql stores. One has to think through either ingestion pattern or the query pattern to design the store representation of entities. This by nature is productivity killer and creates impedance mismatch between storage and representation in application of the entities.

Update (2014-22nd Aug) – DocumentDB – need to test for good amount of data and query patterns but looks like – with auto-indexing, ssd we are on our way here.

  1. Take away my index planning issues

This is another of those areas where lot of heart burn takes place as lot of innards are exposed in terms of the implementation of the store. This if done completely automagically would be great-2 time-saver. Just look at the queries and either create required indexes or drop them. Lot of regression issues for performance are introduced as small changes start accumulating in the application and are introduced at database level.

Update (2014-22nd Aug) – DocumentDB does it automatically , has indexes on everything. It only requires me to drop what I do not need. Thank you.

  1. Make scale out/up easier

Again this is exposed to the end application designer in terms of what entities should be sharded vertically or horizontally. This ties back to 1 in terms of queries ingestion or query. This makes or breaks the application in terms of performance and has impact on evolution of the application.

Update (2014-22nd Aug) – DocumentDB makes it no brainer again. Scaleout is done in CU. Need to understand how the sharding is done.

  1. Make the “adoption” easier by using existing declarative mechanism for interaction. Today one has to choose the store’s way rather than good old DDL/DML which is at least 90% same across systems. This induces fatigue for ISVs and larger enterprises who look at cost of “migration back and forth”. Declarative mechanisms have this sense of lullaby to calm the mind and we indulge in scaleup first followed up scaleout (painful for the application).

Make sure majority of the clients are on par with each other. We may not need something immediately for a rust. But at least ensure php, java, .net native and derived languages have robust enough interfaces.

Make it easier to “extract” my data in case I need to move out. Yes I know this is the least likely option where resources will be spent. But it is super-essential and provides the trust for long term.

Lay out in simple terms roadmap – where you are moving so that I do not spend time on activities which will be part of the offering.

Lay out in simple terms where you have seen people having issues or wrong choices and share the workarounds. Transparency is the key. If the store is not good place for doing like latest “x/y” work – share that and we will move on.

Update (2014-22nd Aug) – DocumentDB provides SQL interface !

  1. Do not make choosing the hardware a career limiting move. We all know-stores like memory. But persistence is key  for trust. SSD/HDD, CPU/Core, Virtualization impact – way too much of moving choices to make. Make 70-90% scenarios simple to decide. I can understand some workloads require lot of memory or only memory – but do not present swarm of choices. Do not tie down to specific brands of storage or networking which we cannot live to see after few years.

In the hosted world – pricing has become crazier – Lay out in simple to understand terms how costing is done. In a way licensing by cores/cpu was great because I did not have think much and pretty much over-provisioned or did a performance test and moved on.

Update (2014-22nd Aug) – DocumentDB again simplifies the discussion, it is SSD backed and pricing is very straightforward – requests – not reads, not writes or indexed collection.

  1. Resolve HA /DR in reasonable manner. Provide simple guide to understand hosted vs host your own worlds. Share in clear manner how should the clients connect, failover. We understand Distributed systems are hard and if store supports distributed world – help us navigate the impact, choices in simple layman terms or something we are already aware of.

If there’s an impact in terms of consistency – please let us know. Some of us care more about it than others. Eventual is great but the day I have to say – waiting for logs to get applied so that reports are not “factual” is not something I am still gung-ho about.

Update (2014-22nd Aug) – DocumentDB – looks like in local DC it is highly available. Assuming cross DC DR is on radar. DocumentDB shares available consistency levels clearly.

  1. Share clearly how monitoring is done for the infrastructure in either hosted/host your own cases. Share a template for “monitor these always” and take these z actions – sort of literal rulebook which makes again makes adoption easier.

Update (2014-22nd Aug) – DocumentDB provides oob monitoring, need to see the template or the 2 things to monitor – I am guessing latency for operation in one and size is another. I need to think through the scaleout unit. I am sure more people push – we will be in better place.

  1. Share how data at rest, data in transport can be secured, audited in simple fashion. For the last piece – even if actions are tracked – we will have simple life.

Update (2014-22nd Aug) – DocumentDB – looks like admin/user permissions are separate. Data storage is still end developer responsibility.

  1. Share simple guide for operations, day to day maintenance – This will be a life saver in terms of x things to look out for, do backups, do checks. This is how to do HA, DR check, performance issue drilldown – normally part of the datahead’s responsibility. Do we look out for unbalanced usage of the environment? IS there some resource which is getting squeezed? What should we do – in those cases?

Update (2014-22nd Aug) – DocumentDB – looks like cases when you need older data because user deleted something inadvertently is something user can push for.

Points 1-4 make adoption easier and latter help in continued use.

10 things I wished my datastore would do (updated: Is DocumentDB my savior?)

No Not DataScience just Data Analysis

Over last 6+ years we have worked with various folks who wanted to learn more from data. This has been more of learning for us

1. Subsidized items beneficiaries – This is very big initiative with potential of pilfering, multiple entitlements. We focused on multiple  entitlements with available digital information.
– missing addresses.
– straight forward same household address
– wrong/unverified addresses with missing documents
– same person name spelled differently and related person information
spelled slightly differently
– having a “presence” across multiple locations farther apart
– missing biometric information where it was required
– corrupted biometric data
– missing “supporting” documents

Most of the issues of dubious addresses/missing/questionable documents indicates
issues at various levels(acceptance, ingestion, approval).

2. Subsidized healthcare data
This enables people to take care of critical health issues in subsidized
fashion.
We found out lot of obvious data issues
– plastic surgery repeated for different body parts for same folks over
years
– people doing delivering kids in short period
– certain districts doing lot more claims overall for surgeries(u, burns,
additional stent
– stay in icu for neuro but medicines for something else
– stay for whipples(oncology surgery) of any kind, increased Mastectomy of any kind without district data showing increase. May be it is just a co-incidence.
– Ureteric Reimplantations, Paediatric Acute Intestinal Obstruction larger
than others.

3. Elector Data
Challenges here range from missing supporting data, duplicate information.
The duplicates or just findings were very interesting
– people living in temples (sadhus are apparently exempt), schools
– multiple families living across various parts of state (labor on move)
– people thinking multiple voter-id cards helps to take advantage of some
gov schemes like ration/subsidized food or just as backup in case one is
lost
– woman married to 4 people …(possible in certain tribal locations)
– people with various versions of name (first, name, family) at same
address with little variation of age thrown in too

4. Non performing assets in lending firms
This sort of bubbled up when the corebanking effort took place and lot of
database “constraints” had to be loosened up to enable uploading in some
places.
– This reflects in lot of accounts with very less substantiated documents
and them turning into NPAs over time.
– Specially bad for the co-operative agencies where governance is very
less.

This was the one case where we used simple classification/clustering
mechanisms to simplify our analysis.

5. Rental cab agency
This was unique in terms of “cost” control measures. One particular trip
always used to consume more fuel then compared to normal transport. It was
found cab drivers congregate outside the expensive parking to avoid paying
it and thus end up using more fuel to come in and pick up customers.
Certain locations/times always again always had bad feedback in terms of
response- reason being drivers located far away with cheaper/no parking
or having food/rest in cheaper location.

At times I would have loved to throw data to blackbox which could throw
back questions and beautiful answers. Honestly more time was spent in
getting data,cleaning, re-entering missing data – (surgery description diff
than type). Later on simple grouping/sum/avg/median (stats) kind of
exploration threw up lot of information that we found.

No Not DataScience just Data Analysis

NoSql Stores and aggregations

In normal db world we are comfortable running queries in creating statistics (sum/min/max/avg), some percentile. Achieving this in efficient way we climb the pre-aggregated  world of olap across dimensions. If we need some kind of range/histograms/filters – we try to anticipate them and provide ui elements and again push of queries to datastore.  With bunch of in-memory columnar storages – we try to do them on fly. With MPP  systems we are comfortable doing it when required.

Over time need has come up to have aggregations to be created in declarative
manner.

Approach in ElasticSearch is pretty nice addition. ES works across the cluster.

In a way to understand – yes ES too are queries of search kind , but declarative model makes them better to fathom.

top_hits agg (coming in 1.3)
Parent/Child support not yet.

It definitely looks little bit more than facets – as the composability is key.

With Cassandra – you do the extra work while writing. But generally it is not
composable and can only do numeric functions. This requirement is tracked at high level here.

Database engines sort of gave away this field by quoting standards, issues of performance/consistency rather than creating decent enough declarative mechanism.90% of queries in DB world will become simpler. Materialized view was the last great innovation and we stopped there. Product designer/engineers should be made to look at the queries people need to write to get simple things done.

 

NoSql Stores and aggregations

GIDS 2014 – Learning from Data – A busy SW professional’s guide to Machine learning

Talk slide can be found here – slideshare, speakerdeck [updated 25th Apr 2014]

For last couple of years we have been helping customers do analysis of data to find insights which are not normally found through reports/kpis. I must extend gratitude here – first is Anand S – whose simple visual analytics helped me move forward. For years I have read Tufte and other books but mostly from usability angle. With D3.js, ggplot2 and wonderful pandas/scikit and R tool set life has become simpler to quickly clean, analyze data. Thanks Anand for sharing your stories.

Another is my team-mate – Vinod – who encouraged me to share rather than try to be perfect as he saw me struggle/learn through last 3 years with various engagements as cohort for customers. We realized active learning, reinforcement learning, bramble forest, NDCG, GINI co-efficient and the humongous maths/algebra/statistics is all important. Algorithm choice is important. But we spent more time in cleaning/organizing the data, we spent more time understanding/getting frustrated why data is not telling us something. Also thanks to Pinal for extending the invite and pushing.

Others are tools/discussion lists – Wise.io, BigML, SkyTree, R, Scikit, Pandas, Numpy, Weka and vowpal wabbit folks. The folks in Cloudera trying to integrate that English firm myrrix?, that database which does all approximation, all the small startups (import.io, mortardata, sumologic to everybody else).

Our own toolset in SQL Server was very dense and difficult to adopt – required way too much ceremony.

It is important to create simpler way to understand the path to the field, de-mystify it so that rest of travellers on this journey do not have issues that we had. With that intention I am presenting in GIDS 2014 Bangalore

I am excited as ever because we get to meet a different set of audience and the expectations are completely different.

Event Location: J. N. Tata Auditorium
National Science Symposium Complex (NSSC)
Sir C.V.Raman Avenue, Bangalore, India

The complete schedule is published here.

My talk is on 25th April 2014.

Time : 11:40-12:25 

This session is targeted at folks who are curious about machine learning and want to get a gist by looking at examples rather than dry theory. It will be a crisp presentation which takes various datasets and uses bunch of tools. Intention here is to share a way to comprehend what is involved at high level in machine learning. Since the ground is very vast this session will focus on applied usage of Machine learning with demos using Excel, R, Scikit and others. You will walk out with what it means to create a model using simple algorithms, evaluate a model. Idea here would be to simplify the topic and create enough interest so that attendees can go and follow-up on topic on their own using their favourite tool.

I am also hoping I will get to bump into familiar friends (Pinal Dave, Amit Bahree, Sunil, Praveen, Balmukund) and hopefully Erick H and the whole Solr gang, Siva (qubole), Regunath (Aadhar fame), Venkat.

GIDS 2014 – Learning from Data – A busy SW professional’s guide to Machine learning

High Availability options for Databases(postgres/oracle/sqlserver/mysql) on Azure

This post was created after having worked with multiple customers requesting information about High Availability options on Azure for different databases. In general High availability has to be thought ground up for a system. The power, gateway access, internal redundant network paths, hardware – disk/cpu/mc fail – everything has to be thought through. Since Azure takes care of the utilities(power/nw paths etc) – we need to focus on local HA within datacenter and DR across datacenter to on-premise/another location.

Most folks are aware of and have used some kind of cluster based services which provides local fail over. For example disk issues are taken care by storage systems and respective raid systems. Some of the cluster services also provide load balancing of requests – or others which redirect reads to secondaries. Their working is not the focus for this discussion.  We are also going to assume client side XA transactions are generally not adopted or great idea.

Traditionally Clustering technology and good SAN was required to provide local HA for database. Cloud platforms have created level field and obsoleting the requirement of expensive SANs and heavy cluster requirements (In some databases earlier – machine configuration had to be exactly same etc).

read-replica

Conceptually most of the relational databases follow above picture in faith. There are various mechanism to synchronize databases but some databases allow seamless fail over to secondary, read replicas and client connection management. In pure cloud setup in one data-center one will try to synchronized transactions either through log push/pull mechanism in two local instances  and take the log and apply to secondary in other locations for disaster recovery purposes.  This is very similar to on-premise setup of these databases. 

With SQL server  AlwaysOn on Azure is ready at present for use. The fail over for clients is automatic and secondary takes over the primary as databases are in sync. 

With Oracle at present Active Dataguard running in pretty much the same way as SQL server is the prefered path. At present SCAN feature (which is part of the RAC) with managed fail over is not present in ADG. For SQL server folks SCAN is concept similar to Listener in AlwaysOn. This does mean there is impact on RTO. There is Golden Gate option too for folks requiring that kind of functionality.(basically multi-master). There are bunch of features in latest Oracle databases to help automatic client failover but applications need to take those into account.

Postgres SQL too has choices (I have not tested streaming) for HA . Admittedly it is little low level in terms of precautions – for e.g., requiring  – adjusting log file sizes -wal_keep_segments for streaming. Edit – And no there is no suggestion to go around and try to do custom cluster solution or pgpool/pgbouncer as clouds have intolerance for holding IP addresses. Pgpool solves is very expansive and solves too many problems. This basically means one has many things to look for when there are issues. For .Net/Sql folks – think of PgPool akin to Sql AlwaysOn (minus the postgres specific features like load balancing , replication, parallel query), PgBouncer is just out of process – connection pooling process.

MySql -has well established HA/DR – master HA (simplest being master slave combination). But anything requiring proprietary cluster service/hw etc is not suggested for cloud environment. MHA (from DeNa) is closest to  SqlServer AlwaysOn in terms of concepts. At present I do not have idea about Tungsten which also provides application failover. Another approach is corosync  & pacemaker.

Concept of providing availability is same across all the relational databases

1. Create virtual network – to isolate the workload(more importantly retain an IP in Azure)

2. Create availability set of two database servers to ensure the pair does not go down at once for maintenance.

3. Create primary server and choose option for pushing/pulling data from secondary(push requires secondary to be up) – I am glossing over setting up wsfc-for-sqlserver, listener, active directory config, setting up archive mode on primary, taking backup and applying it to secondary)

4. Create additional read-replicas if the technology from the database allows it.

5. Create remote location for DR to on premise in asynchronous way. Another cloud location requires cloud to provide inter data center connectivity.

6. If technology like listener as in SqlServer is supported – configure it to provide failover for local setup

Monitoring of the setup is required to detect network partition, workload issues on either of the machines.

For Azure – availability set provides concept for ensuring machines within an availability set do not all go down for a planned maintenance job (by Azure team). For Cloud 2 use Multi-AZ at min for local HA for now to host the machines and push out to region for DR for database. For cloud 3 one has to use again a concept of zones to isolate – “locally” plus replicate to region for database. (for truly moving application – lot more has to be done, but we will restrict here to the database)

Update– This post does not contain lot of material about Application continuity – unfortunately it depends on the client driver and intelligence built into it for identifying failure of primary. Oracle has something called SCAN/FAN+ONS, SQL server client can also detect failure and try for HA, appservers like jboss/weblogic’s have multipool HA facility.

High Availability options for Databases(postgres/oracle/sqlserver/mysql) on Azure

Spark on Azure using Docker – works

For past few weeks trying out docker and found it useful to convey the need of lightweight containers for dev/test.

Although it works like git It presents nice extensions on/around lxc. lxc has extremely simple cli interface to use and run with(as a user I remember being excited by solaris containers long time ago). Docker makes it much more powerful by adding version and reusability imho.

I used it on Azure without issues. When Spark’s docker friendly release was mentioned by Andre it was on my to do list for long time. Intent was to run the perf benchmark using memetracker dataset – will get it on fullfledged cluster one of the days.

Update – 2014-10th-June – MSOpen technologies announces support for docker natively on Azure – http://msopentech.com/blog/2014/06/09/docker-on-microsoft-azure/

Everything mentioned at the repo worked without issues – I just cloned the docker scripts directly. The only change was for the cloning, I used following statement

git clone http://github.com/amplab/docker-scripts.git

Challenge with any new data system is to learn - import/export of data, easy query, monitoring , finding out root cause. That will require some work in real project - somewhere down the road. Got distracted by use of Go in docker in between. 
 
Spark on Azure using Docker – works

Hekaton – aka in-memory OLTP engine tips

Long Pending Blog Post for almost half an year

In case of Hekaton  engine of SQL Server 2014 one has to plan the number of hash buckets to  ensure good performance of selects/inserts. Hash data structure and their collisions/chaining are familiar to regular software person but in a database world these are new things. Most of the information about sizing of the table is shared at http://msdn.microsoft.com/en-us/library/dn205318(v=sql.120).aspx (plan for 1-2 x times the size of the index – always better to over-provision) . With respect to range queries – one has to create an index(same old syntax) while creation of table. A data structure Bw Tree which is used underneath is a good reference when you have time. Basically simple way to understand is – hash structure is pointer to the linked list of row and does not have order and good for equality operator(=), range queries(<,>) on other hand might require ordered traversal. So they require different structure to support this requirement.

DMV sys.dm_db_xtp_hash_index_stats provides information about the hash buckets – how full they are etc.

Another issue associated with present release of  Hekaton which I am  hoping will go away in final release is tight binding between generated code representing in-memory table/procedure – right now purging of older versions is not possible without the restart of the sql server process itself.

DMV sys.dm_os_loaded_modules opens the lid on what module(representation of the native table/procedure) is  loaded for helping the engine.

Few challenges in this CTP release Foreign-Keys, DML triggers, CHECK constraints ,DDL command to change index – more documented end to end – http://msdn.microsoft.com/en-us/library/dn133181(v=sql.120).aspx

For now we suggest use of this technology for folks who are challenged by heavy locking, read intensive table, cpu usage or want to use for session persistence. And in near future when SSIS integration becomes friendlier – for ETL purposes(say support for merge), right now staging table usage is okay(as we might want to push in lot of data).

We also request for proper planning for the growth of the data and not use it for unbounded data.

Is adoption of Hekaton without changes possible as some folks would like to claim  – I would never say that. Presently Foreign-Keys and Check Constraints, identity  are not present -that by itself in majority of the cases requires changes in existing systems.

Since in the background MVCC mechanism is used – applications have to take care of errors when updates results in failure due to value change just like deadlock errors.  So respectfully listen to the spiel and hope “no change” comes one day and for now look at the workloads which can take advantage of this feature.

 

Is there a tool which can help in adopting the Hekaton – yes – http://blogs.technet.com/b/dataplatforminsider/archive/2013/09/17/new-amr-tool-simplifying-the-migration-to-in-memory-oltp.aspx

Other in-memory technologies which are present in the SqlServer stack are Tabular option in SSAS and excel powerpivot. Non Microsoft technologies which have done this – BerkelyDB/TimesTen or Coherence, SolidDB, GigaSpaces. 

For folks like us who saw IMDB and then pulled back – this is sweet revenge. Finally changes have been done at engine level for minimizing log i/o,  reduce latches by using a data structure which prevents contention and great integration with existing technology pillars like availability.

For good “from the horse’s mouth” discussion – look for Sunil Agarwal’s – DBI-B307 Tech-Ed session on your favourite search engine.

Hekaton – aka in-memory OLTP engine tips

Respecting a schemaless(dynamic schema) store’s strength

Recently I was contacted by multiple folks using a nosql/datastructure store for guidance about daily operational issues, bad data & lot of finger pointing.

In one of the case – The main product manages the click impressions and tries to give “perspective” on views of the ad. Unfortunately this is “mixture of solutions” each with own application life cycle and ownership.

Issues

1. Mismatched data types
Incident – Application 2 started storing arrays where string was expected. It must have been because of the change in other app when requirement changed from “single value” to multi-valued item. . This error was caught in the few filter queries & reporting intermittently. It was difficult to catch the issue as it would happen for certain filters. Other impact is increased storage. If this information was/is small  -a simpler type could have represented the data.

2. Subtle change in the name of the document element resulting missing data
Incident – Some application started reporting incomplete data. Cause was wrong name change resulting in a new field resulting in bigger bug.

3. Change in data type precision
Incident – suddenly values did not match up in reconcilliation, a float was changed to integer (gasp).

In document store by the nature of it being a document database it is very-easy to simple mistakes

e.g.,

>db.test.save( { x:1 } )

>db.test.find()

— Now store string for same key without an issue
>{ “_id” : ObjectId(“508b3d791a1d9f06773cb597”), “x” : 1 }
>db.test.save( { x:”str” } )

>db.test.find()

>{ “_id” : ObjectId(“508b3d791a1d9f06773cb598”), “x” : “str” }

> db.campaigndetails.save (
{ _id: “xkcd”, “details” : { campaign_id: 1, inventory_id: 2, audience_attr:3 },rev:10 })

> db.campaigndetails.find( {“_id”:”xkcd”} )
{ “_id” : “xkcd”, “details” : { “campaign_id” : 1, “inventory_id” : 2, “audience_attr” : 3 }, “rev” : 10 }>
> db.campaigndetails.find( {“id”:”xkcd”} )
Is that a store’s issue? Nope it is the user in this case a developer’s responsibility to take right measures so that these situations are avoided. What are other repercussions?  Let us say there is a “flow of data” to another place. All the etl/messaging kind of processing have greater chance of failing if the type changes or extra field gets added.
How can one overcome it

1. Use common data repository access code
2. Use common data repository validation code
3. Use common messaging transformation code
4. Use basic layout verification code
5. Since many of the document databases do not support joins folks created either  embedded/hierarchical or linked structure. Embedded schema has an implication for changing the “value” of related item and many issues get reflected here.

6. Reconcilliation of catalog (schema) with data values – this can be run once in a while to look for any deviation from the agreed upon schema

Another datastructure store example showing dynamic schema

redis 127.0.0.1:6379> set car.make “Ford”
OK
redis 127.0.0.1:6379> get car.make
“Ford”
redis 127.0.0.1:6379> get car.test
(nil)
redis 127.0.0.1:6379> get car
(nil)
redis 127.0.0.1:6379> get car.Make
(nil)
redis 127.0.0.1:6379> set u ‘{ “n”:”f_name”}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”:\”f_name\”}”
redis 127.0.0.1:6379> set u ‘{ “n”: 1}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”: 1}”

Sadly people are throwing baby with bathwater in their hurry and forgetting somebody will need to maintain the Data and derive value from it. Data is sacred and it’s sanctity should be always maintained.

There is an issue of schema design which has  influence in some of the above issues. Rather than blindly “embedding” documents a simple check should be done for “immutability”. If there is possibility of change ,lot of data is getting duplicated with growth. It is better to keep it separate. Another suggestion would be to keep the names extra-small.

Respecting a schemaless(dynamic schema) store’s strength

Update – Article on Cloud and Hive

Suprotim and Sumit pushed me to publish an article on “Decision Making Pivots for adoption of Cloud” – This is basically gist of guiding principles we use to help customers to migrate to cloud. We have few variations for enterprise strategy where workloads like exchange(email), sharepoint (collaboration) or CRM need to move to cloud. We have a colleague MS Anand who helps customer on the Private cloud adoption front to create efficiencies out of existing infrastructure. Here is the document which focuses on Azure and was part of the magazine.  Azure Adoption – pivots to help make right decision

I just completed something else I promised Suprotim – an article on comparing Hive for people who are used SQL as dialect to interact with Database. Although comparison of database and hive is not strictly apple-to-apple comparison. I wanted to take an approach where understanding BigData does not become a burden of learning MapReduce/hdfs and overall hadoop ecosystem. It is much easier to start  doing something very simple that we do with regular data store and try to do it with Hive and then start looking at differences. It also helps to understand why HDFS and map-reduce are helping in addressing scale and availability for very large amount of data. Although there are tools like Pig/Cascalog/Scalding/Cascading- I decided to focus on HiveQL as it is closest to SQL dialect with simple intention of not introducing many new things simultaneously.  Once the article is out for a month in the magazine – I plan to share it here again or you can pick it from http://www.dotnetcurry.com/magazine/dnc-magazine-issue2.aspx (updated – 1st Sep 2012) once it comes online.

And if everything goes allright with help and push from Vinod & Pinal – I will devote energies toward something more useful.

Update – 1st Sep 2012 – Things I have not covered in Hive Article-

* There is tab for DisasterRecover on hosted azure which when switched on –

It is not clear WRT NameNode whether FsImage and EditLogs  are backed up every x minutes to another “managed” location.

* Is there a secondary namenode where log/checkpoint gets shared to.
* There is a secondary namenode but execution of command against it in RDP simply hangs.
* WRT HDFS data
* Is data snapshotted/backed up to Azure storage and takes advantage of inherent replication there. (updated – preferred storage is azure storage rather than local nodes)
* WRT Hive Metadata – Is it backed up to “Managed” location every x hours/minutes (no clear idea)

* If NameNode crashes – Not clear now (WRT to HadoopOnAzure) – Whether AppFabric services inherent in Azure are utilized to identify and bring it up  & use the earlier “managed” location  (using –importCheckpoint option)

* Upgrade & rollback of underlying version will be part of HOA’s lifecycle management. Assumption here is at present one version will be prevalent across tenants. Upgrading individual clusters to different  version is not supported. (updated-2013 December – upgrade to new version is supported)

*  Addition/deletion of nodes into existing cluster (still a manual job)

*  Adding incremental data (updated – normal import process)

* Adding a Fair scheduler

*  Monitoring the job progress/cancellation (updated DEc-2013 – powershell based)

* Identifying bottlenecks  in JVM/hdfs settings (completely roll your own)

* Dealing with hadoop fsck identifying bad/missing blocks and related issues (do you own)

* Rebalancing the data (do your own)

Updated – 20th Sep 2012

Cloudera posted a wonderful article on using flume, oozie and hive to analyze the tweets. http://www.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

Update – Article on Cloud and Hive

Hive article & Azure adoption article

I recently wrote an article for DotNetCurry – putting down experiences with Azure migration pivots.

I will be jotting down experience and basics of hadooponAzure in same way.

I intend to cover how to look at hadoop from database user’s point of view. It will cover storage, query & loading of Data using apache projects such as pig/hive. We will try not get down to map-reduce jobs as starting point as they tend to cloud the judgement for adoption for administration/developers familiar with SQL dialect. They depend on it to define schema, query. We will cover availability (strongest point of hdfs), scalability (hdfs -easy adding of nodes) , querying (pig/hive) . This article will not cover machine learning, performance tuning of hdfs/mr jobs, installation, management/monitoring.

I will try to publish a link to the word document with errata for the azure article here one of the days.

Hive article & Azure adoption article