Azure Resource Manager – a journey to understand basics

As of BUILD 2015 we have declarative or procedural deployment control of assets or group of assets for Network and Compute resources. In November/Oct 2014 excitement was fixed towards ability to download and apply gallery templates or work with website assets. But this release of ARM API (2015) has brought in lot of abilities.

This enables sophisticated scenarios. For e.g., Corey sanders has an example of using it to deploy containers on a virtual machine with famous three together (nginx,redis and mongo). Or look at chef integration or Network Security Groups, RBAC (role based access).

You can interact with assets via Portal UX. But it is easier to understand what is happening using PowerShell commands in Debug/Verbose mode.

ARM-Build2015

What are Resource Groups
Usually applications deployed in Microsoft Azure are composed of a combination of different cloud assets (e.g., VMs, Storage accounts, a SQL database, a Virtual Network etc). RGs are provided by RG providers to be provisioned/managed in a location

Ref – https://msdn.microsoft.com/en-us/library/azure/dn948464.aspx

Let us get started. Verify in your Azure PowerShell

$PSVersionTable 

is 3.0 or 4.0+. And

 Get-Module AzureResourceManager 

is 0.9 at least. Otherwise you need to get latest Azure Powershell

Let us start from simple command to deploy a resource group

Switch-AzureMode AzureResourceManager

#do not execute following, maybe do get-help on it to get an idea of what it requires
get-help New-AzureResourceGroup -detailed
get-help New-AzureResourceGroupDeployment -detailed

The command NewAzureResourceGroup implies we need a simplistic json file. How much simpler we can make it. If you look at github profile hosting all the examples they are pretty daunting if you are starting out first time. So let us start from basics.

What are Azure Resource Manager Templates?
Azure Resource Manager Templates allow us to deploy and manage these different resources together by using a JSON description of the resources and associated configuration and deployment parameters. After creating JSON-based resource template, you can pass it to the Powershell command to execute its directives which ensures that the resources defined within it deployed in Azure.

Create a simple file helloarm.json, this is the template file for describing resources, their properties and ways to populate them via parameters. We will go on simple journey to accomplish that. By the end of session we will have single parameter file which pushes data to the template file which in turn is used by ARM to create resources.

helloarm.json We will explore the possible content of each collection or array as required later.

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {},
    "variables": {},
    "resources": [],
    "outputs": {
        "result": {
            "value": "Hello World",
            "type": "string"
        }
    }
}

Let us see how we can test it out, execute the following

New-AzureResourceGroup -Name GSK-ARM-HelloARM -Location "South Central US" -Tag @{Name="GSK-ARM-RG";Value="TEST"} 

this will result in empty resource group which can later be populated by New-AzureResource or New-AzureResourceGroupDeployment cmdlets to add resources and deployments to this resource group.

What is with parameters of the powershell command?

  • Name is the name of the resource group
  • Location is the data center we want to deploy to.
  • Tag is the new feature which allows us to tag our resources with company approved labels to classify assets.bare-min-rg

    For now let us remove the newly created RG by

     Remove-AzureResourceGroup -Name "GSK-ARM-HelloARM" 

    – this command does not return any output by default. Use -Passthru parameter to get that information and to suppress the questions – use -Force parameter. At present you can’t immediately see Operational log populated with these actions in your portal. We will come to it in a minute.

    Now let us provision using our empty template

     
    New-AzureResourceGroup –Name "GSK-ARM-HelloARM" –Location "South Central US" -Tag @{Name="GSK-ARM-RG";Value="TEST"}
    ## This creates a RG for us to add resources to
    New-AzureResourceGroupDeployment -Name "GSK-ARM-DEP-HelloARM" -ResourceGroupName "GSK-ARM-HelloARM" -TemplateFile .\helloarm.json
    
    DeploymentName    : GSK-ARM-DEP-HelloARM
    ResourceGroupName : GSK-ARM-HelloARM
    ProvisioningState : Succeeded
    Timestamp         : 5/3/2015 7:58:52 PM
    Mode              : Incremental
    TemplateLink      :
    Parameters        :
    Outputs           :
                        Name             Type                       Value
                        ===============  =========================  ==========
                        result           String                     Hello World
    

    Can you execute the 1st command without location or Name? Or you wished Test-AzureResourceGroup existed?

    You can see the resource group and associated tags

     
    Get-AzureResourceGroup -Name GSK-ARM-HelloARM
    
    ResourceGroupName : GSK-ARM-HelloARM
    Location          : southcentralus
    ProvisioningState : Succeeded
    Tags              :
                        Name        Value
                        ==========  =====
                        GSK-ARM-RG  TEST
    
    Permissions       :
                        Actions  NotActions
                        =======  ==========
                        *
    
    ResourceId        : /subscriptions/XXXXXXXXXXXXXXXXX/resourceGroups/GSK-ARM-HelloARM2
    
    

    Let us get the tags

     
    (Get-AzureResourceGroup -Name GSK-ARM-HelloARM2).Tags 
    
    Name                           Value
    ----                           -----
    Value                          TEST
    Name                           GSK-ARM-RG
    

    What happened to the deployment

     
    Get-AzureResourceGroupLog -ResourceGroup GSK-ARM-HelloARM2 # docs says to provide -Name - but that is wrong.
    
     
    Get-AzureResourceGroupDeployment -ResourceGroupName GSK-ARM-HelloARM2
    
    DeploymentName    : hell-arm
    ResourceGroupName : GSK-ARM-HelloARM2
    ProvisioningState : Succeeded
    Timestamp         : 5/3/2015 4:37:14 AM
    Mode              : Incremental
    TemplateLink      :
    Parameters        :
    Outputs           :
                        Name             Type                       Value
                        ===============  =========================  ==========
                        result           String                     Hello World
    

    aha ! there you see the output information 🙂

     
    Get-AzureResourceLog -ResourceId /subscriptions/XXXXXXXXXXX/resourceGroups/GSK-ARM-HelloARM 
    

    does not provide anything if you run this command after an hour of inactivity on that resource. You can add -StartTime 015-05-01T00:30 parameter by varying your date to get idea about what all has happened on that particular resource. In this case you should see a write operation like

     
    EventSource : Microsoft.Resources
    OperationName : Microsoft.Resources/subscriptions/resourcegroups/write
    

    Ok can we just enable tags for a subscription via the template file as is possible through this REST API https://msdn.microsoft.com/en-us/library/azure/dn848364.aspx ? Nope not that I am able to find something right now.

    So let us create storage account with a tag in a new Resource Group.

    Ok so what are these resources we keep talking about ?

     
    Get-AzureResource | Group-object ResourceType | Sort-Object count -descending | select Name 
    

    Above command will mostly have following resources in the top 10 as output resources if you have used Azure in classic way for sometime (it will not list all the possible resources, it is only listing used Resources)
    Microsoft.ClassicStorage/storageAccounts
    Microsoft.ClassicCompute/domainNames
    Microsoft.ClassicCompute/virtualMachines
    Microsoft.Web/sites
    Microsoft.ClassicNetwork/virtualNetworks
    Microsoft.Sql/servers/databases

    Let us explore creating a storage account resource in a resource group

     
    $RGName= "GSK-RG-STORAGE-TEST"
    $dName = "GSK-RG-DEP-STORAGE-TEST"
    $lName ="East US"
    $folderLocation = "YourDirectory"
    $templateFile= $folderLocation + "\azuredeploy.json"
    

    Create following .json file and save in folderLocation mentioned above.

    azuredeploy.json

     
    {
        "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
        "contentVersion": "1.0.0.0",
        "parameters": {},
        "variables": {},
        "resources": [{
            "type": "Microsoft.Storage/storageAccounts",
            "name": "gskteststoragesc",
            "location": "East US",
            "apiVersion": "2015-05-01-preview",
            "tags": {
                "dept": "test"
            },
            "properties": {
                "accountType": "Standard_LRS"
            }
        }],
        "outputs": {
            "result": {
                "value": "Hello World Tags & storage",
                "type": "string"
            }
        }
    }
    

    It will be good to look at azuredeploy.json file. Output directive is almost same, this can be retrievd in Get-AzureResourceGroupDeployment command .

  • Type represents the Resource provider of storageAccount (v2) – the new resource, compared to Microsoft.ClassicStorage/storageAccounts.
  • Properties belong to the asset/resource in this case – what kind of storage we want – local replica or geo dr etc. We have hardcoded it for now.
  • location represents the datacenter location, in this case we have hard coded to “East US”
  • Tags represent the tags we want to apply
  • apiVersion is used by ARM to ensure latest compliant bits are used to provision resource at right location.

Let us deploy this resource

 
New-AzureResourceGroup –Name $RGName –Location $lName -Tag @{Name="GSK-ARM-RG";Value="TEST"}
## This creates a RG for us to add resources to
 
New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile
## dname provides deployment label for the resources specified in the template file to be deployed to RGName. 

Your new portal might show visualization like below
success-storage-test

 
Get-AzureResourceLog -ResourceId /subscriptions/#####################/resourceGroups/GSK-RG-STORAGE-TEST -StartTime 2015-05-02T00:30 #modify this for your time and subscription

Now let us modify this template to accept parameters from other file. Make following changes to azuredeploy.json and save it as azuredeploy2.json Notice evertyhting else remains same

 
{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {

        "newStorageAccountName": {
            "type": "string",
            "metadata": {
                "description": "Unique DNS Name for the Storage Account where the Virtual Machine's disks will be placed."
            }
        },
        "location": {
            "type": "string",
            "defaultValue": "West US",
            "allowedValues": [
                "West US",
                "East US",
                "West Europe",
                "East Asia",
                "Southeast Asia"
            ],
            "metadata": {
                "description": "Location of resources"
            }

        }
    },
    "variables": {
        "storageAccountType": "Standard_LRS"
    },
    "resources": [{
        "type": "Microsoft.Storage/storageAccounts",
        "name": "[parameters('newStorageAccountName')]",
        "location": "[parameters('location')]",
        "apiVersion": "2015-05-01-preview",
        "tags": {
            "dept": "test"
        },
        "properties": {
            "accountType": "[variables('storageAccountType')]"
        }
    }],
    "outputs": {
        "result": {
            "value": "Hello World Tags & storage",
            "type": "string"
        }
    }
}

What has changed ? Focus on the resources section 1st.

  • Look at the name – this name will be picked from parameters which is of type string.
  • Location is interesting in the sense it has a default value of US West, you will see the impact of same later on in case you do not provide location.
  • AccountType on other hand is picked from variables section and is of single type STANDARD_LRS – local replication.
  • Then we have allowedValues in Parameter section for location – this ensures only these values are accepted when cmdlet New-AzureResourceGroupDeployment executes using this template file. Takes away the pain of validating the inputs.

Let us execute the command earlier to provision this new Resource Group

After setting $templateFile properly to point to new template

$templateFile = $folderLocation + "\azuredeploy2.json"
New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile

You will get prompted for the parameter – newStorageAccountName – in our case we gave it gskrgstparam and got following output


DeploymentName    : GSK-RG-DEP-STORAGE-TEST
ResourceGroupName : GSK-RG-STORAGE-TEST
ProvisioningState : Succeeded
Timestamp         : 5/3/2015 4:36:15 PM
Mode              : Incremental
TemplateLink      :
Parameters        :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    newStorageAccountName  String                     gskrgstparam
                    location         String                     West US

Outputs           :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    result           String                     Hello World Tags & storage

1st thing – notice the new parameters section getting populated and location parameter not providing the prompt and picking up default value “US West”. If you have access to the new portal , you will see something similar for the resource.
param-based-prompt-storage-acct

Let us see if we can automate this command line interaction alltogether.

Create a new file azure2deploy.parameters.json with following content

{
    "newStorageAccountName": {
        "value": "gskrgparamfiletest"
    },

    "location": {
        "value": "West US"
    }
}

Create new variable for holding template parameters file location

$templateParam = $folderLocation + "\azure2deploy.parameters.json"

Now execute the similar provisioning command now with template parameters file passed to it. You will not get prompted for the name of the storage account as you have already passed it in the parameters file in command line.

New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile -TemplateParameterFile $templateParam

DeploymentName    : GSK-RG-DEP-STORAGE-TEST
ResourceGroupName : GSK-RG-STORAGE-TEST
ProvisioningState : Succeeded
Timestamp         : 5/3/2015 4:59:36 PM
Mode              : Incremental
TemplateLink      :
Parameters        :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    newStorageAccountName  String                     gskrgparamfiletest
                    location         String                     West US

Outputs           :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    result           String                     Hello World Tags & storage

param-based-fileinput-storage-acct
To wrap this piece

Get-AzureResourceGroup -Name $RGName
ResourceGroupName : GSK-RG-STORAGE-TEST
Location          : eastus
ProvisioningState : Succeeded
Tags              :
                    Name        Value
                    ==========  =====
                    GSK-ARM-RG  TEST

Permissions       :
                    Actions  NotActions
                    =======  ==========
                    *

Resources         :
                    Name                Type                               Location
                    ==================  =================================  ========
                    gskrgparamfiletest  Microsoft.Storage/storageAccounts  westus
                    gskrgstparam        Microsoft.Storage/storageAccounts  westus
                    gskteststoragesc    Microsoft.Storage/storageAccounts  eastus

ResourceId        : /subscriptions/XXXXXXXXXXXXXXX/resourceGroups/GSK-RG-STORAGE-TEST


Ok that is lot of stuff for the session. We started from bare-minimum resource group and gradutated to using parameters, variables and finally using parameters file to the provisioning engine. We found how we could get log of the resource or remove it. REST API allows you to check status after the post has been done.
WRT the .json file which has different sections like output to dump – messages, resources to hold resources to provision which in turn pick data from variables and parameter section.

Other questions could be

    How do you modify say a deployment? – Add a disk to VM? That is a different process for now.
    What is the relation between chef, puppet, ansible and ARM ? It is symbiotic as chef and puppet create infrastructure usng ARM and provision software install/check not possible through simple shell/install files. Look at

KundanaP’ and John Gossman’s s sample

    https://github.com/Azure/azure-quickstart-templates/tree/master/chef-json-parameters-ubuntu-vm
    What has ARM got to do with sharing and governance ? You can create common azuredeploy.json and folks can create different azuredeploy.parameters.json files to create their own environment. So you control the deployment.
    Where are more samples? https://github.com/Azure/azure-quickstart-templates You can even contribute more by following guidelines there.

What do I miss – ability to put monitoring hook, but looking at providers from OpsInsight and AppsInsight- I can guess they are coming.

For fun now let us create one VM which provisions Aerospike on DS series machine on ubuntu. We will need to execute shell command via customlinuxscript extension and learn dependencies so that we do not create individual assets.

Ref – http://blog.davidebbo.com/, Mahesh T and Tabrez.
REST APIhttps://msdn.microsoft.com/en-us/library/azure/dn790568.aspx
Continue reading “Azure Resource Manager – a journey to understand basics”

Advertisement
Azure Resource Manager – a journey to understand basics

Azure Linux tip – swappiness

In general folks disable swap of memory bound processes for linux instances (ymmv).

How to detect swapfile is present

1. grep -i –color swap /proc/meminfo

2. swapon -s

3. free -m

You will get confirmation no swap is setup. If you check for swappiness via cat /proc/sys/vm/swappiness though you will see swapping of default 60 :). Question on your mind will be where it is doing the swapping.

What should you do ? In general no swapping is good thing, so setting that swappiness to 0 is good thing with default installation. In case you require swapfile(which you will – if you care about latest kernel changes), Add a swap file based off local disk(sdb1 on the /mnt mostly or ssd ) on the guest (do not add azure storage) for the instance.

How to modify swappiness  (for a web or file server) – echo 5  | sudo tee /proc/sys/vm/swappiness or – sudo sysctl vm.swappiness= 5 – To persist this setting through reboots it is better to edit the /etc/sysctl.conf and ensure add the swapfile to fstab. No swapping is good for lucene workloads(solr/elasticsearch), databases (cassandra/mongo/mysql/postgres etc) but for stability reasons at high constantly peaked machines- it is good to have local disk/ssd as help

How to allocate swapfile  usually you will do it on local disk – use df -ah to get mount name) —- Allocate swapfile

– sudo fallocate -l 4G /mnt/swapfile (ensure size is double the memory size)

— Ensure root has access

– sudo chmod 600 /mnt/swapfile

– sudo mkswap /mnt/swapfile

– verify free -m

– add to fstab

– sudo nano /etc/fstab **** add line *** /mnt/swapfile none swap sw 0 0

To switch off swapping completely On Linux systems, you can disable swap temporarily by running:sudo swapoff -a.

To disable it permanently, you will need to edit the /etc/fstab file and comment out any lines that contain the word swap.

To ensure swapiness is switched after reboot

# Set the value in /etc/sysctl.conf
sudo echo ” >> /etc/sysctl.conf
sudo echo ‘#Set swappiness to 0 to avoid swapping’ >> /etc/sysctl.conf
sudo echo ‘vm.swappiness = 0’ >> /etc/sysctl.conf

Why to swap if nobody likes swapping and it is not 90s – For safety.  From kernel version 3.5-rc1 and above, a swappiness of 0 will cause the OOM killer to kill the process instead of allowing swapping. (ref – http://java.dzone.com/articles/OOM-relation-to-swappiness ) While you are at all of this do notice – df /dev/shm and see what you can do about it. Do you want to use it?

Ref –

  1. ElasticSearch – from strong  bootstrap.mlockall  – with  suggestion swappiness to zero to switch it off  and also instruct oom not to kill it, http://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html
    1. When Otis says something – I just follow it. http://elasticsearch-users.115913.n3.nabble.com/mlockall-vs-vm-swappiness-td4028126.html
  2. Solr –  (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_tuning_solr.html )
  3. Cassandra http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html
  4. MySql – https://mariadb.com/kb/en/mariadb/configuring-swappiness/
  5. MongoDB – http://docs.mongodb.org/manual/faq/diagnostics/
  6. Postgres –  it is the same suggestion.
  7. Oracle – http://docs.oracle.com/cd/E24290_01/coh.371/e22838/tune_perftune.htm#COHAG223
Azure Linux tip – swappiness

Azure throttling errors

Most of the cloud services provide elasticity  creating illusion of unlimited resources. But many  times hosted services need to push back requests to  provide good governance.

Azure does a good job providing information about this throttling in various ways across services. One  of the 1st service was SQLAzure which provided error
to help the client to retry. Slowly now all services are providing information when they are throttled. Depending on whether you access native API or REST endpoint you get this information in different ways.  I am hoping slowly comprehensive information from services and underlying resources like network, cpu and memory starts percolating like storage so that client, monitoring systems can manage workloads.

Azure DocumentDB provides throttling error and also the time after which to retry.
(HTTP error 429 ) . It definitely is ahead of other services for providing this exclusive information.

Azure Storage on other hand provides information to the native client so that it can back off retry. It also pushes this information into metrics. A great paper exists which provides information about Azure transactions and capacity.

SQL Azure Throttling    – was one of the 1st services to provide throttling information to due to crud/memory operations(45168,45169,40615,40550,40549,40551,40554,40552,40553).

Azure Search throttling provides HTTP error 429/503 so that client can take proper action.

Azure Scheduler provides HTTP Status 503 as it gets busy and expects client to retry.

Azure Queue, Service Bus Queue both send back 503 which REST clients can take advantage of.

Biztalk services  provides “Server is busy. Please try again”

Over # of years we always request customers to exploit Azure and one of the ways is to actually is to work with hosted services and plan workloads by catching these kind of errors. Some of the customers like SQLAzure’s throttling so much they wished they want some of those soft/hard throttling errors in on-premise database.

Most of the Azure services do not charge when quotas are hit or throttling is done. Idea for the client is to back off and try again. I hope though the “monitoring” becomes better say for example in case of biztalk services – a client should be able to query the “busy-ness” since it has to try after systems becomes less busy. SQlAzure’s retry logic has been the well codified and understood over years.

Just in case you wonder other public cloud services too have throttling?  Public cloud services are shared infrastructure and implement throttling for governance. Throttling is implemented and it is exposed in different ways. DynamoDB for example – has 400 series of error codes with specifically LimitExceededException,  ProvisionedThroughputExceededException, ThrottlingException as an example. Almost every service has 400 series of errors with Throttling as specific exception.

Azure throttling errors

10 things I wished my datastore would do (updated: Is DocumentDB my savior?)

We use datastores generally to ingest data and try to make some meaning out of it by means of reports and analytics. Over years we have had to make decisions in terms of adopting different stores for “different” workloads.

Simplest being the Analysis – where we offload to pre-aggregated values with either columnar or distributed engines to scaleout the volume of data. We have also seen rise of stores which allow storage of data which is friendly for range of data. Then we have some which allow very fast lookups, maturing to doing aggregations on run. We have also seen use of data-structure stores – the hash table inspired designs vs the ones which don sophisticated avatars (gossips, vector clocks, bloom filters, LSM trees).

That other store which pushed compute to storage is undergoing massive transformation for adopting streaming, regular oltp (hopefully) apart from its regular data reservoir image. Then we have the framework based plug and play systems doing all kind of sophisticated streaming and other wizardry.

Many of the stores require extensive knowledge about the internals of the store in terms of how data is laid out, techniques for using right data types, how data  should be queried, issues of availability and taking decisions which are generally “understandable” to the business stakeholders. When things go wrong – the tools differ in range from just log error to actual “path of the execution” of the query. At present there is lot of ceremony for thinking about the capacity management, issues around how data changes are logged and should be pushed to another location. This much of detail is great “permanent job guarantee” but does not add lot of value in long term for the business.

2014-22nd Aug Update – DocumentDB seems to take away most of the pain – http://azure.microsoft.com/en-us/documentation/services/documentdb/

  1. Take away my schema design issues as much as it can

What do I mean by it? Whether it is traditional relational databases or the new generation no-sql stores. One has to think through either ingestion pattern or the query pattern to design the store representation of entities. This by nature is productivity killer and creates impedance mismatch between storage and representation in application of the entities.

Update (2014-22nd Aug) – DocumentDB – need to test for good amount of data and query patterns but looks like – with auto-indexing, ssd we are on our way here.

  1. Take away my index planning issues

This is another of those areas where lot of heart burn takes place as lot of innards are exposed in terms of the implementation of the store. This if done completely automagically would be great-2 time-saver. Just look at the queries and either create required indexes or drop them. Lot of regression issues for performance are introduced as small changes start accumulating in the application and are introduced at database level.

Update (2014-22nd Aug) – DocumentDB does it automatically , has indexes on everything. It only requires me to drop what I do not need. Thank you.

  1. Make scale out/up easier

Again this is exposed to the end application designer in terms of what entities should be sharded vertically or horizontally. This ties back to 1 in terms of queries ingestion or query. This makes or breaks the application in terms of performance and has impact on evolution of the application.

Update (2014-22nd Aug) – DocumentDB makes it no brainer again. Scaleout is done in CU. Need to understand how the sharding is done.

  1. Make the “adoption” easier by using existing declarative mechanism for interaction. Today one has to choose the store’s way rather than good old DDL/DML which is at least 90% same across systems. This induces fatigue for ISVs and larger enterprises who look at cost of “migration back and forth”. Declarative mechanisms have this sense of lullaby to calm the mind and we indulge in scaleup first followed up scaleout (painful for the application).

Make sure majority of the clients are on par with each other. We may not need something immediately for a rust. But at least ensure php, java, .net native and derived languages have robust enough interfaces.

Make it easier to “extract” my data in case I need to move out. Yes I know this is the least likely option where resources will be spent. But it is super-essential and provides the trust for long term.

Lay out in simple terms roadmap – where you are moving so that I do not spend time on activities which will be part of the offering.

Lay out in simple terms where you have seen people having issues or wrong choices and share the workarounds. Transparency is the key. If the store is not good place for doing like latest “x/y” work – share that and we will move on.

Update (2014-22nd Aug) – DocumentDB provides SQL interface !

  1. Do not make choosing the hardware a career limiting move. We all know-stores like memory. But persistence is key  for trust. SSD/HDD, CPU/Core, Virtualization impact – way too much of moving choices to make. Make 70-90% scenarios simple to decide. I can understand some workloads require lot of memory or only memory – but do not present swarm of choices. Do not tie down to specific brands of storage or networking which we cannot live to see after few years.

In the hosted world – pricing has become crazier – Lay out in simple to understand terms how costing is done. In a way licensing by cores/cpu was great because I did not have think much and pretty much over-provisioned or did a performance test and moved on.

Update (2014-22nd Aug) – DocumentDB again simplifies the discussion, it is SSD backed and pricing is very straightforward – requests – not reads, not writes or indexed collection.

  1. Resolve HA /DR in reasonable manner. Provide simple guide to understand hosted vs host your own worlds. Share in clear manner how should the clients connect, failover. We understand Distributed systems are hard and if store supports distributed world – help us navigate the impact, choices in simple layman terms or something we are already aware of.

If there’s an impact in terms of consistency – please let us know. Some of us care more about it than others. Eventual is great but the day I have to say – waiting for logs to get applied so that reports are not “factual” is not something I am still gung-ho about.

Update (2014-22nd Aug) – DocumentDB – looks like in local DC it is highly available. Assuming cross DC DR is on radar. DocumentDB shares available consistency levels clearly.

  1. Share clearly how monitoring is done for the infrastructure in either hosted/host your own cases. Share a template for “monitor these always” and take these z actions – sort of literal rulebook which makes again makes adoption easier.

Update (2014-22nd Aug) – DocumentDB provides oob monitoring, need to see the template or the 2 things to monitor – I am guessing latency for operation in one and size is another. I need to think through the scaleout unit. I am sure more people push – we will be in better place.

  1. Share how data at rest, data in transport can be secured, audited in simple fashion. For the last piece – even if actions are tracked – we will have simple life.

Update (2014-22nd Aug) – DocumentDB – looks like admin/user permissions are separate. Data storage is still end developer responsibility.

  1. Share simple guide for operations, day to day maintenance – This will be a life saver in terms of x things to look out for, do backups, do checks. This is how to do HA, DR check, performance issue drilldown – normally part of the datahead’s responsibility. Do we look out for unbalanced usage of the environment? IS there some resource which is getting squeezed? What should we do – in those cases?

Update (2014-22nd Aug) – DocumentDB – looks like cases when you need older data because user deleted something inadvertently is something user can push for.

Points 1-4 make adoption easier and latter help in continued use.

10 things I wished my datastore would do (updated: Is DocumentDB my savior?)

The other “requirements” of the managed datastores in cloud

We(me and @Vinod – author of  extremexperts) have supported migration to managed SQLAzure stores for quite sometime. Customers like ease of manageability, availability and decent performance.

There is another class of customers who keep getting pushed for “consolidating” databases and manage them for SLAs (DR/HA,backups-go-back- intime-x,performance). These databases are not in TBs but range from few GBs to 100s of GBs.

1. There is need for synchronization with on- premise databases and gasp sometimes it needs to be bidirectional.

2. There is need of meeting security SLAs by providing auditing views, encryption.

Promise of cloud where it enables ease of management/availability also needs to enable these scenarios. Hopefully in future we will get these.

The other “requirements” of the managed datastores in cloud

Nginx on Azure

Nginx works on Azure, absolutely no issues. It has very vast capabilities. I came to know of few of them only when customer requested that discussion.

1. Ability to control request processing – Customer wanted to throttle number of requests coming from a particular IP address. This was easily done with limit_req module directive. It allowed easy throttling behavior defn, what to do when limits are reached, crossed. Logging is done for these kind of requests and ability to send specific http error message is possible. (503 is enough). It also enables storing the state of current excessing requests. Another learning was to use $binary to help pack a little bit more – though it does make it difficult to decipher in simple way. So in the http block

limit_req_zone $binary_remote_addr zone=searchz:10m rate=5r/s;

followed by location (end points which need this – login/search)

location = /search.html { limit_req zone=searchz nodelay; }

This protects very nicely against http issues but does not protect against ping floods and other ways people can do ddos for your application. This is best prevented/controlled in some kind of appliance (hw) or at least iptables. That though is different subject alltogether. There is another directive

2. Splitting clients for testing – This too is very easily done in the configuration with split_clients directive. It can also be used to set specific querystring parameters very easily.

Yes there are dedicated services/apps to do achieve same functionality – but it is wonderful to learn everyday. Customer/Partners are King and honestly  great teachers.

Nginx on Azure

Data Ingestion and Store stories

In about last 6 months we have had good fortune to understand/implement 6 solutions for customers who need fast ingestion and then some kind of analytics on it. This is a gist of those interactions, what worked, what did not fly, workarounds.

These solutions pushed us to explore things not available out of box on the platform. We
were exposed to
– 3 customer designing monitor/analyze/predict solution. They had existing inhouse
platform but it requires local storage, changes involve changing the software/hardware.
2 of them did not even have “automated monitoring” – a person would go – “note down” the
reading on paper/smartphone web-app and then this would be aggregated and stored in
central location.
– All of them wanted to move to public cloud except one who wanted something they could
deploy on-premise too.

Domains
– Electricity
– Pharma manufacturing
– Healthcare
– Chemical/heavy metal manufacturing

Data sources
– Sensors
– linux/windows embedded devices collecting aggregating floor/section/machine wise data
– Humans entering data

Latency
– almost everything except healthcare varied from 10s of minutes to hours.

Data size
– Since data could get buffered/batched/massaged depending on situation. Never more than an MB.
– Few Hundred Kbs

In-Order/One time delivery guarantees?
– Very pragmatic customers – they were ok to define an error rate rather than insisting
on specifics.

Not even one wanted direct “sending” of data to “store”. They wanted local/web
intermediate processing. This why this internet of things where protocols are rigid and stores fixed was surprise for us all.
How to ingest fast
– what could be the front end
– does it make sense to have intermediate queue

How to scale the store 
– always capture- key condition

How to query with low latency
– search/lookup specific items – for logs/keywords/facet around them
– aggregates/trends
– detailed raw reports
– help in “outage”/”demand” – constant across electricity/manufacturing
– definition of outage/demand change

What works as store
Cassandra
– if you think of read queries before hand as it dominates the design (CQL or otherwise)
*** all facet kind of stuff – which is sort of groupby + no relevancy is dependent on
how data is stored.
– scales – scales and scales
– Reads are pretty good and many of “aggregates” which do not require as of last
millisecond/second resolution – can be done by simple running jobs which store these
“more latency” items in another store – k/v or relational and generally cached
aggressively. (mostly flushed out after x entries to another store)

– Push out data for other kind of analysis to fav store – hdfs and absorb into other
places
Challenge is monitoring(infra vs running of the Cassandra and parameter impact) and
skillset upgradation.

– At times customer have used store – numeric/other data in cassandra and push –
unstructured data(stack trace/messages/logs) out to lucene derivative –
Solr/ElasticSearch. Challenge has been “consistency” at given time but generally works.

How to ingest/Broker
– WebApi front ends pushing data into broker (rabbitmq/msmq/kafka) – mostly based on
experience and comfort factor
** To try – Akka/Orleans + Storm for Near real time analytics
** Only one brave soul still doing Kafka + Storm – painful to manage/monitor

Need better “Monitoring across the stack” tools.

Multi-tenancy is another issue which blows up due to sku differentiation where sometimes
data can shared/but updates-patching becomes an issue.
Data movement in Azure becomes a bigger issue and we have implemented as mentioned here.

 

 

Data Ingestion and Store stories

ElasticSearch on Azure – sure

Why

1. Get the event logs, errortraces, exceptions in one location and enable powerful search which can scale out seamlessly. Ideally one could/should use
logstash-(poor man’s *plunk alternative)
2. Create a search frontend for your application for frequently looked up items, cached items or just regular search based system what you would do for
catalog of items, issues(customer pain points) or gulp even primary data store for certain kind of applications

We have used and proposed Solr earlier – lately elastic search’s monitoring and simplicity of scaleout/availability is what has made us to push this lucene based alternative more for customers.

When you would not use this kind of search service
If there is hosted native search service which offers cheaper storage and better query times (based of faster backend) or you are ready to pay the $ for given throughput and storage.

How

sudo apt-get openjdk-7jdk
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.7.tar.gz
tar -xzf elasticsearch-0.90.7.tar.gz

/bin/elasticsearch -f (and you start dumping/querying the data) or put in init.d

Monitor
ElasticHq – http://www.elastichq.org/support_plugin.html (available as hosted version too)
Kopf – https://github.com/lmenezes/elasticsearch-kopf
BigDesk – https://github.com/lukas-vlcek/bigdesk/ (more comprehensive imho)
OOB stats – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
Paid – http://sematext.com/spm/elasticsearch-performance-monitoring/

Learnhttp://www.elasticsearch.org/videos/bbuzz2013-getting-down-and-dirty-with-elasticsearch/
Tips
Use Oracle JDK
Use G1 GCC(http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-All)

Kibana/logstash too work without any issues.

Caveat – Azure does not support multicast, so discovery becomes based on unicast – mechanism and pretty much coded into the configuration file.

ElasticSearch on Azure – sure

Spark on Azure using Docker – works

For past few weeks trying out docker and found it useful to convey the need of lightweight containers for dev/test.

Although it works like git It presents nice extensions on/around lxc. lxc has extremely simple cli interface to use and run with(as a user I remember being excited by solaris containers long time ago). Docker makes it much more powerful by adding version and reusability imho.

I used it on Azure without issues. When Spark’s docker friendly release was mentioned by Andre it was on my to do list for long time. Intent was to run the perf benchmark using memetracker dataset – will get it on fullfledged cluster one of the days.

Update – 2014-10th-June – MSOpen technologies announces support for docker natively on Azure – http://msopentech.com/blog/2014/06/09/docker-on-microsoft-azure/

Everything mentioned at the repo worked without issues – I just cloned the docker scripts directly. The only change was for the cloning, I used following statement

git clone http://github.com/amplab/docker-scripts.git

Challenge with any new data system is to learn - import/export of data, easy query, monitoring , finding out root cause. That will require some work in real project - somewhere down the road. Got distracted by use of Go in docker in between. 
 
Spark on Azure using Docker – works

5 years and going – Adopting Cloud Azure

It has been nearly 5 years since we started working on Azure cloud platform. We are small team and finally I thought we have worked enough to call ourselves little knowledgeable on various platform parts- ours and others, what works and what does not, how to make the move. Microsoft Azure has evolved over years to support these requirements. It is still long way to go…but path is right.  We help guide customers prioritize the workload which they can move as part of comprehensive briefing in adopting Cloud platform either as private, public or hybrid.

I have worked with customers who want to move everything in cloud hoping it will mask challenges on-premise (scale, performance, monitoring) to pragmatic folks who pick and choose workloads like email first(established enterprise challenges are myriad – no email sending/only receiving).  There are many folks who want to take advantage of local infrastructure and move forward. Some folks just pick simplest and easiest – backup on cloud to get feet wet. Others push dev/test to cloud to minimize local requirements. Path varies for Enterprises & ISVs and we have lot to offer.

When migrating applications to cloud platform, simple evaluation measures are

  1. Legal issues (any issue with putting data on cloud, encryption required- implies key management)
  2. Performance requirements and verified by tests
    1. IO/Network/CPU – end user workload – have a plan for these tests to ensure end user and perf testing is done.
  3. Is the application end of life or the tools used not supported anymore – like VB6/old client server power builder applications or dos/QT based application, they will provide lot more roadblocks than progress.
  4. Advantages one wants to exploit of cloud – scale out, elasticity – Is that possible with existing applications. Should they be modified?
  5. Availability requirements  (What are the availability requirements and what you can live with) – one data center vs DR to others – data movement/deployment – warm/cold. 
  6. Is system in this form able to meet SLA. Otherwise modify/decouple to achieve the availability. Simplest requirement of handling throttling, failures of underlying infra requires change
  7. Requirement with Integration with On premise applications (authentication/Antivirus), pushing/pulling of data.
  8. Operational stuff – How will you do ALM –deploy a set of software(os+dependencies+app+topology), push patches/updates, backup, DR for for these applications.
  9. Monitoring requires culture change and we have seen developers jolted out of lethargy to adopt/learn new tools and work with admin folks to provide SLAs of performance and availability using canaries, graceful-degradation,failover frameworks. Existing on premise tools are becoming better but cloud based ones like NewRelic/Boundary for backend applications Gomez/Keynote for reachability/availability, corelate with frontend tools like Errorception  are easier to adopt.

Non Functional stuff while doing migration

  1. Chalk out responsibilities of the involved people (application owner, Services provider, vendor)
    1. Steps/Goals of each milestone (performance/availability/workarounds)
    2. Chalk out support steps for each application (escalation steps )
      1. Placement of cloud expert locally to handhold/support + dedicated support (concepts like storage/availability group or zones from other cloud platform)
    3. Support of ISV apps by guiding them over time to the platform
    4. Chalk out how migration monitoring is done  (easiest – daily/weekly vs fire meetings and owners from stakeholders)
    5. Production monitoring for capacity/outage/testing of failover (tool based testing – simianarmy of your own)

I have seen customers surprised as they see the amount of work they sometimes need to do to adopt cloud (availability/monitoring/performance – shared infrastructure ). Since we get customers who have tried/used other cloud platforms – so it is always fun and encouraging to see something what Google compute announced (transparent vm migration ) or what AWS added – Read replicas – life in general is becoming easier to be a developer. Although it means adapting/changing to new platform , it is a sweet journey.  With frameworks from Netflix/Twitter/Linkedin and other folks – one can literally start hit ground running. This really is best time to be a developer.

Normally our help is taken by field facing teams serving our customers – but do feel free to reach out to mtcbang at microsoft dot com for help in adopting cloud platform (public/private/hybrid) or specific workloads like database , integration, sharepoint, security(device/application/infra), adopting byod-managing devices or windows 8/phone .

There is another small announcement – we are looking for a person – preferably a woman stationed out of Mumbai  as part of our team(http://www.microsoft.com/en-us/mtc/default.aspx ). Requirements are very simple – listen to customer – have empathy – understand the pain and resolve it using right tools and technologies.

  • Could mean using/suggesting architecture change – decouple/monitor/canary/shard the db at app or db/use reactive pattern  or help create greenfield solution from ground up (20-25% of job)
  • looking at the code at javascript, c#, java – choose your poison and suggesting changes (all associated tools/issues right from ide – webstorm vs x to express vs y,idiomatic way – know your monitoring of running modern apps – resources like memory/io/cpu/nw – identifying patterns of problems)
  • Look at performance/maintenance/availability issues around sql server/mysql/oracle (yeah we are ok with person having postgres/sybase/oracle experience as techniques remain same – tool name and methodology changes a little)
  • Ability to pick up redis/cassandra/mongodb/hbase  – yeah – we have get many of them too.
  • At times be generalist and use capabilities of sharepoint and showcase how it helps enhance productivity using its arsenal. Yes generic sharepoint knowledge and associated tools information is good.
  • Have an Idea what BI means – facts,dimensions, – what are the tools which one can use and newer ones (hadoop lake + aggregations via jobs ) + near real time stuff (streaminsight ,storm and friends).
  • Have generic idea about messaging platforms (connect to source/destinations via transforms, add routing/orchestration ) – it is okay not to know this particular piece – but ideally exposed once or twice in the field
  • Have basics in place – which data structure I can use on mobile device constrained by memory/storage for storage/query or what makes pragmatic sense – connecting two systems – this is very much required than a “particular way of product-feature” bent as we need to think of different ways and brainstorm the ideas with customers
  • Open mind to pick up things right from phantomjs, d3.js or sci-kit to optiq and adopt/use them

Again if you know somebody or you are interested – please reach out @ [myid] at microsoft dot com – [govindk] . We do whole lot of other things like – http://dreamadream.org/2013/02/funday-at-the-microsoft-technology-center-mtc-bangalore/ and generally do not travel – are very flat and fun (shhh – we are known to do monday movies – beer/food at new joints . Basically a no-bs , do-your-thing workplace.  You will have some of the most fun people like Vinod – http://blogs.extremeexperts.com/about/  (book writer, well known speaker and great friend) to Anand – https://twitter.com/tweetmsanand ( our private cloud, infrastructure, security herder ).

5 years and going – Adopting Cloud Azure