Azure Resource Manager – a journey to understand basics

As of BUILD 2015 we have declarative or procedural deployment control of assets or group of assets for Network and Compute resources. In November/Oct 2014 excitement was fixed towards ability to download and apply gallery templates or work with website assets. But this release of ARM API (2015) has brought in lot of abilities.

This enables sophisticated scenarios. For e.g., Corey sanders has an example of using it to deploy containers on a virtual machine with famous three together (nginx,redis and mongo). Or look at chef integration or Network Security Groups, RBAC (role based access).

You can interact with assets via Portal UX. But it is easier to understand what is happening using PowerShell commands in Debug/Verbose mode.

ARM-Build2015

What are Resource Groups
Usually applications deployed in Microsoft Azure are composed of a combination of different cloud assets (e.g., VMs, Storage accounts, a SQL database, a Virtual Network etc). RGs are provided by RG providers to be provisioned/managed in a location

Ref – https://msdn.microsoft.com/en-us/library/azure/dn948464.aspx

Let us get started. Verify in your Azure PowerShell

$PSVersionTable 

is 3.0 or 4.0+. And

 Get-Module AzureResourceManager 

is 0.9 at least. Otherwise you need to get latest Azure Powershell

Let us start from simple command to deploy a resource group

Switch-AzureMode AzureResourceManager

#do not execute following, maybe do get-help on it to get an idea of what it requires
get-help New-AzureResourceGroup -detailed
get-help New-AzureResourceGroupDeployment -detailed

The command NewAzureResourceGroup implies we need a simplistic json file. How much simpler we can make it. If you look at github profile hosting all the examples they are pretty daunting if you are starting out first time. So let us start from basics.

What are Azure Resource Manager Templates?
Azure Resource Manager Templates allow us to deploy and manage these different resources together by using a JSON description of the resources and associated configuration and deployment parameters. After creating JSON-based resource template, you can pass it to the Powershell command to execute its directives which ensures that the resources defined within it deployed in Azure.

Create a simple file helloarm.json, this is the template file for describing resources, their properties and ways to populate them via parameters. We will go on simple journey to accomplish that. By the end of session we will have single parameter file which pushes data to the template file which in turn is used by ARM to create resources.

helloarm.json We will explore the possible content of each collection or array as required later.

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {},
    "variables": {},
    "resources": [],
    "outputs": {
        "result": {
            "value": "Hello World",
            "type": "string"
        }
    }
}

Let us see how we can test it out, execute the following

New-AzureResourceGroup -Name GSK-ARM-HelloARM -Location "South Central US" -Tag @{Name="GSK-ARM-RG";Value="TEST"} 

this will result in empty resource group which can later be populated by New-AzureResource or New-AzureResourceGroupDeployment cmdlets to add resources and deployments to this resource group.

What is with parameters of the powershell command?

  • Name is the name of the resource group
  • Location is the data center we want to deploy to.
  • Tag is the new feature which allows us to tag our resources with company approved labels to classify assets.bare-min-rg

    For now let us remove the newly created RG by

     Remove-AzureResourceGroup -Name "GSK-ARM-HelloARM" 

    – this command does not return any output by default. Use -Passthru parameter to get that information and to suppress the questions – use -Force parameter. At present you can’t immediately see Operational log populated with these actions in your portal. We will come to it in a minute.

    Now let us provision using our empty template

     
    New-AzureResourceGroup –Name "GSK-ARM-HelloARM" –Location "South Central US" -Tag @{Name="GSK-ARM-RG";Value="TEST"}
    ## This creates a RG for us to add resources to
    New-AzureResourceGroupDeployment -Name "GSK-ARM-DEP-HelloARM" -ResourceGroupName "GSK-ARM-HelloARM" -TemplateFile .\helloarm.json
    
    DeploymentName    : GSK-ARM-DEP-HelloARM
    ResourceGroupName : GSK-ARM-HelloARM
    ProvisioningState : Succeeded
    Timestamp         : 5/3/2015 7:58:52 PM
    Mode              : Incremental
    TemplateLink      :
    Parameters        :
    Outputs           :
                        Name             Type                       Value
                        ===============  =========================  ==========
                        result           String                     Hello World
    

    Can you execute the 1st command without location or Name? Or you wished Test-AzureResourceGroup existed?

    You can see the resource group and associated tags

     
    Get-AzureResourceGroup -Name GSK-ARM-HelloARM
    
    ResourceGroupName : GSK-ARM-HelloARM
    Location          : southcentralus
    ProvisioningState : Succeeded
    Tags              :
                        Name        Value
                        ==========  =====
                        GSK-ARM-RG  TEST
    
    Permissions       :
                        Actions  NotActions
                        =======  ==========
                        *
    
    ResourceId        : /subscriptions/XXXXXXXXXXXXXXXXX/resourceGroups/GSK-ARM-HelloARM2
    
    

    Let us get the tags

     
    (Get-AzureResourceGroup -Name GSK-ARM-HelloARM2).Tags 
    
    Name                           Value
    ----                           -----
    Value                          TEST
    Name                           GSK-ARM-RG
    

    What happened to the deployment

     
    Get-AzureResourceGroupLog -ResourceGroup GSK-ARM-HelloARM2 # docs says to provide -Name - but that is wrong.
    
     
    Get-AzureResourceGroupDeployment -ResourceGroupName GSK-ARM-HelloARM2
    
    DeploymentName    : hell-arm
    ResourceGroupName : GSK-ARM-HelloARM2
    ProvisioningState : Succeeded
    Timestamp         : 5/3/2015 4:37:14 AM
    Mode              : Incremental
    TemplateLink      :
    Parameters        :
    Outputs           :
                        Name             Type                       Value
                        ===============  =========================  ==========
                        result           String                     Hello World
    

    aha ! there you see the output information 🙂

     
    Get-AzureResourceLog -ResourceId /subscriptions/XXXXXXXXXXX/resourceGroups/GSK-ARM-HelloARM 
    

    does not provide anything if you run this command after an hour of inactivity on that resource. You can add -StartTime 015-05-01T00:30 parameter by varying your date to get idea about what all has happened on that particular resource. In this case you should see a write operation like

     
    EventSource : Microsoft.Resources
    OperationName : Microsoft.Resources/subscriptions/resourcegroups/write
    

    Ok can we just enable tags for a subscription via the template file as is possible through this REST API https://msdn.microsoft.com/en-us/library/azure/dn848364.aspx ? Nope not that I am able to find something right now.

    So let us create storage account with a tag in a new Resource Group.

    Ok so what are these resources we keep talking about ?

     
    Get-AzureResource | Group-object ResourceType | Sort-Object count -descending | select Name 
    

    Above command will mostly have following resources in the top 10 as output resources if you have used Azure in classic way for sometime (it will not list all the possible resources, it is only listing used Resources)
    Microsoft.ClassicStorage/storageAccounts
    Microsoft.ClassicCompute/domainNames
    Microsoft.ClassicCompute/virtualMachines
    Microsoft.Web/sites
    Microsoft.ClassicNetwork/virtualNetworks
    Microsoft.Sql/servers/databases

    Let us explore creating a storage account resource in a resource group

     
    $RGName= "GSK-RG-STORAGE-TEST"
    $dName = "GSK-RG-DEP-STORAGE-TEST"
    $lName ="East US"
    $folderLocation = "YourDirectory"
    $templateFile= $folderLocation + "\azuredeploy.json"
    

    Create following .json file and save in folderLocation mentioned above.

    azuredeploy.json

     
    {
        "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
        "contentVersion": "1.0.0.0",
        "parameters": {},
        "variables": {},
        "resources": [{
            "type": "Microsoft.Storage/storageAccounts",
            "name": "gskteststoragesc",
            "location": "East US",
            "apiVersion": "2015-05-01-preview",
            "tags": {
                "dept": "test"
            },
            "properties": {
                "accountType": "Standard_LRS"
            }
        }],
        "outputs": {
            "result": {
                "value": "Hello World Tags & storage",
                "type": "string"
            }
        }
    }
    

    It will be good to look at azuredeploy.json file. Output directive is almost same, this can be retrievd in Get-AzureResourceGroupDeployment command .

  • Type represents the Resource provider of storageAccount (v2) – the new resource, compared to Microsoft.ClassicStorage/storageAccounts.
  • Properties belong to the asset/resource in this case – what kind of storage we want – local replica or geo dr etc. We have hardcoded it for now.
  • location represents the datacenter location, in this case we have hard coded to “East US”
  • Tags represent the tags we want to apply
  • apiVersion is used by ARM to ensure latest compliant bits are used to provision resource at right location.

Let us deploy this resource

 
New-AzureResourceGroup –Name $RGName –Location $lName -Tag @{Name="GSK-ARM-RG";Value="TEST"}
## This creates a RG for us to add resources to
 
New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile
## dname provides deployment label for the resources specified in the template file to be deployed to RGName. 

Your new portal might show visualization like below
success-storage-test

 
Get-AzureResourceLog -ResourceId /subscriptions/#####################/resourceGroups/GSK-RG-STORAGE-TEST -StartTime 2015-05-02T00:30 #modify this for your time and subscription

Now let us modify this template to accept parameters from other file. Make following changes to azuredeploy.json and save it as azuredeploy2.json Notice evertyhting else remains same

 
{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {

        "newStorageAccountName": {
            "type": "string",
            "metadata": {
                "description": "Unique DNS Name for the Storage Account where the Virtual Machine's disks will be placed."
            }
        },
        "location": {
            "type": "string",
            "defaultValue": "West US",
            "allowedValues": [
                "West US",
                "East US",
                "West Europe",
                "East Asia",
                "Southeast Asia"
            ],
            "metadata": {
                "description": "Location of resources"
            }

        }
    },
    "variables": {
        "storageAccountType": "Standard_LRS"
    },
    "resources": [{
        "type": "Microsoft.Storage/storageAccounts",
        "name": "[parameters('newStorageAccountName')]",
        "location": "[parameters('location')]",
        "apiVersion": "2015-05-01-preview",
        "tags": {
            "dept": "test"
        },
        "properties": {
            "accountType": "[variables('storageAccountType')]"
        }
    }],
    "outputs": {
        "result": {
            "value": "Hello World Tags & storage",
            "type": "string"
        }
    }
}

What has changed ? Focus on the resources section 1st.

  • Look at the name – this name will be picked from parameters which is of type string.
  • Location is interesting in the sense it has a default value of US West, you will see the impact of same later on in case you do not provide location.
  • AccountType on other hand is picked from variables section and is of single type STANDARD_LRS – local replication.
  • Then we have allowedValues in Parameter section for location – this ensures only these values are accepted when cmdlet New-AzureResourceGroupDeployment executes using this template file. Takes away the pain of validating the inputs.

Let us execute the command earlier to provision this new Resource Group

After setting $templateFile properly to point to new template

$templateFile = $folderLocation + "\azuredeploy2.json"
New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile

You will get prompted for the parameter – newStorageAccountName – in our case we gave it gskrgstparam and got following output


DeploymentName    : GSK-RG-DEP-STORAGE-TEST
ResourceGroupName : GSK-RG-STORAGE-TEST
ProvisioningState : Succeeded
Timestamp         : 5/3/2015 4:36:15 PM
Mode              : Incremental
TemplateLink      :
Parameters        :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    newStorageAccountName  String                     gskrgstparam
                    location         String                     West US

Outputs           :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    result           String                     Hello World Tags & storage

1st thing – notice the new parameters section getting populated and location parameter not providing the prompt and picking up default value “US West”. If you have access to the new portal , you will see something similar for the resource.
param-based-prompt-storage-acct

Let us see if we can automate this command line interaction alltogether.

Create a new file azure2deploy.parameters.json with following content

{
    "newStorageAccountName": {
        "value": "gskrgparamfiletest"
    },

    "location": {
        "value": "West US"
    }
}

Create new variable for holding template parameters file location

$templateParam = $folderLocation + "\azure2deploy.parameters.json"

Now execute the similar provisioning command now with template parameters file passed to it. You will not get prompted for the name of the storage account as you have already passed it in the parameters file in command line.

New-AzureResourceGroupDeployment -Name $dname -ResourceGroupName $RGName -TemplateFile $templateFile -TemplateParameterFile $templateParam

DeploymentName    : GSK-RG-DEP-STORAGE-TEST
ResourceGroupName : GSK-RG-STORAGE-TEST
ProvisioningState : Succeeded
Timestamp         : 5/3/2015 4:59:36 PM
Mode              : Incremental
TemplateLink      :
Parameters        :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    newStorageAccountName  String                     gskrgparamfiletest
                    location         String                     West US

Outputs           :
                    Name             Type                       Value
                    ===============  =========================  ==========
                    result           String                     Hello World Tags & storage

param-based-fileinput-storage-acct
To wrap this piece

Get-AzureResourceGroup -Name $RGName
ResourceGroupName : GSK-RG-STORAGE-TEST
Location          : eastus
ProvisioningState : Succeeded
Tags              :
                    Name        Value
                    ==========  =====
                    GSK-ARM-RG  TEST

Permissions       :
                    Actions  NotActions
                    =======  ==========
                    *

Resources         :
                    Name                Type                               Location
                    ==================  =================================  ========
                    gskrgparamfiletest  Microsoft.Storage/storageAccounts  westus
                    gskrgstparam        Microsoft.Storage/storageAccounts  westus
                    gskteststoragesc    Microsoft.Storage/storageAccounts  eastus

ResourceId        : /subscriptions/XXXXXXXXXXXXXXX/resourceGroups/GSK-RG-STORAGE-TEST


Ok that is lot of stuff for the session. We started from bare-minimum resource group and gradutated to using parameters, variables and finally using parameters file to the provisioning engine. We found how we could get log of the resource or remove it. REST API allows you to check status after the post has been done.
WRT the .json file which has different sections like output to dump – messages, resources to hold resources to provision which in turn pick data from variables and parameter section.

Other questions could be

    How do you modify say a deployment? – Add a disk to VM? That is a different process for now.
    What is the relation between chef, puppet, ansible and ARM ? It is symbiotic as chef and puppet create infrastructure usng ARM and provision software install/check not possible through simple shell/install files. Look at

KundanaP’ and John Gossman’s s sample

    https://github.com/Azure/azure-quickstart-templates/tree/master/chef-json-parameters-ubuntu-vm
    What has ARM got to do with sharing and governance ? You can create common azuredeploy.json and folks can create different azuredeploy.parameters.json files to create their own environment. So you control the deployment.
    Where are more samples? https://github.com/Azure/azure-quickstart-templates You can even contribute more by following guidelines there.

What do I miss – ability to put monitoring hook, but looking at providers from OpsInsight and AppsInsight- I can guess they are coming.

For fun now let us create one VM which provisions Aerospike on DS series machine on ubuntu. We will need to execute shell command via customlinuxscript extension and learn dependencies so that we do not create individual assets.

Ref – http://blog.davidebbo.com/, Mahesh T and Tabrez.
REST APIhttps://msdn.microsoft.com/en-us/library/azure/dn790568.aspx
Continue reading “Azure Resource Manager – a journey to understand basics”

Advertisement
Azure Resource Manager – a journey to understand basics

Schedulers – some notes

Networking: csfq, wfq,drr, wf2q, sfq , Proportinal Fair, I-CSDPS

OS: Linux cfs, proportional sharing, lottery

Datacenters/GP Cluster: Hadoop ecosystem(Presto-Cloudera-YARN-SPARK) fair sched, capacity sched, QuincyLSF , condor (Ignore MPI folks)
Criteria
Scalability. (response time, number of machines)
Flexibility (heterogeneous mix of jobs)
Usability/grokability
Isolation – Fault isolation, Resource Isolation
Utilization(Achieve high cluster resource utilization. e.g., cpu utilization, memory utilization)  – Balance the hosts  – Meet the constraints of host

Service or Batch Jobs?

** who|process dominates, which resource to give priority to
** how to catch cheaters?
** How do you pre-empt
** In case of multiple schedulers – make them aware of each other – shared state to avoid one scheduler/workload dominating

Types

Global Scheduler needs to have state
Policies + resource availability
How much do we know about job/tasks
Job requirements (throughput, response time? ,availability)
Job Plan (Dag of tasks or what?,  I/O needs, User affinity )
Estimates of duaration?, Input Size? , TX
Single vs Multiple scheduler agents + cluster state replicated into the nodes
Monlothic Platform LSF, Maui, Moab (HPC community)
Multi-step – Partition resources or dynamically allocate them (Mesos NSDI 2011) – can reject the offer
*** How long job has to wait
*** Fair sharing ()
Partition and resolve dependencies (Omega EuroSys 2013)
Issue – Upgrade/patching of scheduler or substrate

References

BarrelFish http://www.barrelfish.org/peter-phd-multicore-resource-mgmt.pdf

Tetris – http://research.microsoft.com/en-us/UM/redmond/projects/tetris/index.html

Click to access tetris_sigcomm14.pdf

Corona – https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
( fair-share scheduling)
“This scheduler is able to provide better fairness guarantees because it has access to the full snapshot of the cluster and jobs when making scheduling decisions. It also provides better support for multi-tenant usage by providing the ability to group the scheduler pools into pool groups. A pool group can be assigned to a team that can then in turn manage the pools within its pool group. The pool group concept gives every team fine-grained control over their assigned resource allocation.”

Docker (Fleet/Citadel/Sampi/Mesos)
sampi – https://github.com/mshamber/sampi/blob/master/example/scheduler.go
citadel – https://github.com/citadel/citadel/blob/master/scheduler/resource_manager.go

Docker Swarm
https://github.com/docker/swarm/tree/master/scheduler/strategy (CPU|RAM vs random !)
https://github.com/docker/swarm/tree/master/scheduler/filter (what are the constraints – same storage, same area, some tags?)

Mesos (http://mesos.berkeley.edu/mesos_tech_report.pdf)
https://github.com/apache/mesos/blob/master/include/mesos/scheduler.hpp
containerization – https://github.com/apache/mesos/blob/master/src/slave/containerizer/linux_launcher.cpp
Filtering (nw |fs – ) https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
**** MesosContainerizerProcess::isolate
(strings::contains(isolation, “cgroups”) ||
strings::contains(isolation, “network/port_mapping”) ||
strings::contains(isolation, “filesystem/shared”) ||
strings::contains(isolation, “namespaces”))
**** process::reap
**** executorEnvironment (
MesosContainerizerProcess::_launch
MesosContainerizer::recover
Isolation –  https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp
(Fair scheduler dependent on the co-ordination)
* Mesos/Yarn resource managers have a master-slave architecture.  IS it me or they both have have adopted an SMPD MPI rank-x style job control ? Sort of Push (Mesos) and Pull (Yarn ) model.

Others
Slurm – http://slurm.schedmd.com/ seems to be deployed for v. large clusters. – http://slurm.schedmd.com/gang_scheduling.html – the interactivity is pretty cool
Torque – http://www.adaptivecomputing.com/products/open-source/torque/ (http://www.pbsworks.com/)

Profiling – http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36575.pdf

Yarn – Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of SoCC, 2013.
Condor – http://research.cs.wisc.edu/htcondor/
Quincy – http://research.microsoft.com/apps/pubs/default.aspx?id=81516
Vector Bin Packing – http://research.microsoft.com/apps/pubs/default.aspx?id=147927
Proactive-Inria – http://proactive.activeeon.com/
Omega – http://research.google.com/pubs/pub41684.html
Clustera – http://research.cs.wisc.edu/clustera/
Dryad –
http://research.microsoft.com/en-us/projects/dryad/

Click to access eurosys07.pdf

REservation based scheduling – http://research.microsoft.com/pubs/204122/rayon.pdf
X-Flex – Alternative to DRF – http://viswa.engin.umich.edu/wp-content/uploads/sites/169/2014/08/X-FlexMW14.pdf
Stoica – http://www.cs.cmu.edu/~istoica/csfq/
WF2Q – http://spazioinwind.libero.it/andreozzi/thesis/resources/schedulers/WF2Q.pdf
OS
VTRR – https://www.usenix.org/legacy/event/usenix01/full_papers/nieh/nieh.pdf (O(1) in less than 100 lines of code? )
– Order the clients in the run queue from largest to smallest share
Lottery – https://www.usenix.org/legacy/publications/library/proceedings/osdi/full_papers/waldspurger.pdf (randomized – based on drawl of ticket for the client)
Pegasus – http://www.utdallas.edu/~cxl137330/courses/spring14/AdvRTS/protected/slides/17.pdf
WFQ – Clients are ordered in a queue sorted from smallest to largest Virtual finish time

Encyclopedia – http://www.cs.huji.ac.il/~feit/papers/SchedSurvey97TR.pdf
Time stamped Scheduler – comparision – http://www.cse.iitk.ac.in/users/bpssahoo/paper/19_A_Comparative_Analysis_of_Time_Stamped__.pdf

Virtualization
VMware – http://www.vmware.com/resources/techresources/10345 (gang or co-scheduling)
– related – http://www.cs.huji.ac.il/~feit/papers/FlexCosched03IPDPS.pdf
HyperV – http://blogs.msdn.com/b/virtual_pc_guy/archive/2011/02/14/hyper-v-cpu-scheduling-part-1.aspx
Xen – http://www.hpl.hp.com/personal/Lucy_Cherkasova/papers/per-3sched-xen.pdf

 

Schedulers – some notes

Scraper Breakers

Name of an asset does not have any sanctity. No seriously.

Following are the file names for missing folks for last 2 years in Karanataka.

Click to access Feb_2015_Missing_report.pdf

Click to access Jan_2015_Missing_report.pdf

Click to access Dec_2014_Missing_report.pdf

Click to access Nov_2014_Missing_report.pdf

Click to access Oct_2014_Missing_report.pdf

Click to access SEP_2014_Missing_report.pdf

Click to access Missing_Report_Aug_2014.pdf

Click to access July_Missing_Report_2014.pdf

http://www.ksp.gov.in/download/June_Missing_Report_2014.rar

Click to access May_Missing_Report_2014.pdf

Click to access April_Missing_Report_2014.pdf

Click to access Missing_Report_March_2014.pdf

Click to access Missing_Report_Feb_2014.pdf

Click to access Missing_Report_Jan_2014.pdf

Click to access Missing_Report_December_2013.pdf

http://www.ksp.gov.in/download/Old%20Missing%20Data%20of%20UDR%20&%20Missing%20GZT.rar

Click to access Missing_Report_November_2013.pdf

Click to access Missing_Report_October_2013.pdf

Click to access Missing_Report_September_2013.pdf

Click to access Missing_Report_August_2013.pdf

Click to access Missing_Report_July_2013.pdf

Click to access Missing_Report_June_2013.pdf

Click to access May_missing_13.pdf

Click to access April_Missing_2013.pdf

Click to access March_Missing_2013.pdf

They all respond to the url http://www.ksp.gov.in/home/crime/udr.php with structure in pic . Note the names and the “links” to “missing”. Sadly actually file names mismatch, Big Deal ?

Absolutely not for a person who loves cleaning the data , a sort of OCD. This is like God Sent. “One thing you had to do right”.

So what is broken? Process or Tool. There is definite issue of simplicity of “naming convention” and following it. Why do people people forget it, because they are evil? no because our tools make it difficult for them to contextualize the work at hand and follow all “implied” rules.

Missing-report

There is PDF, DOC, XLS world which hides the data and then there is data in html files. Absolutely priceless. Thanks to import.io – I can at least do these things in jiffy to identify what I am getting out and the pattern.

Update – pdf extractors I use/try.

Excel and word – 1st. Excel for structured data and Word otherwise.

 Apache PDFBox – Download page: http://pdfbox.apache.org/downloads.html.

Tabula – Download page: http://tabula.nerdpower.org.

PDF Extraction Toolkit – Download Page:http://tamirhassan.com/pdfxtk.html.

Poppler –  Download page: http://poppler.freedesktop.org/

PDF2XML – Download Page: http://sourceforge.net/projects/pdf2xml/

Xpdf  – Download Page: http://www.foolabs.com/xpdf/

Scraper Breakers

Debt of evolving RPC mechanisms

Idea of pushing some data(state or worst object itself) across wire with interop across languages has been one concept which has seen umpteen births. I hope we do not have to invent, adopt any new RPC for sometime.

Because I am done.

DCOM to Remoting
– MTS was the last decent app server which did not evolve into app servers as on other side(offcourse the ejb and friends) where firms made whole lot of money by providing layers to intercept, modify, throttle the object/message passing. ChannelFactory/sinks and friends were at best whole lot of method call to message magic. So many sinks were written before good souls realized that it is sin to write so many sinks….

Then we had realization that we can live together and had madness around interop. Does  not happen. Try looking at WS-Security and friends, it is still nightmare to think about it.

.Net had two separate paths in scheme of things. Slowly they got integrated with IIS and its ‘isms. (provisioning, invocation, pipeline )

1. WCF (SOAP , WS * Services – TX, security using envelopes/headers)
– classic
– ria services (silverlight )
– data services (astoria) , OData precursor
– web HTTP

2. IIS hosted world evolved and adopted much faster indicating where the world is going and proving Web server is the app server.
ASP.NET  (http verbs and resource representation overtime )
– asmx
– mvc (yes folks used this to provide an api endpoint)
– web api (no tcp, no mq, no soap, hopefully savior for some time)
– web api data

Lately (over 5 years ) we have seen resurgence of which serialization is better, which rpc method can work across languages. Fortunately this time folks are more pragmatic.

New serialization cum rpc friendly layers
– Bond *yeah microsoft’s – https://github.com/Microsoft/bond
– Avro (schema as json inside header)
– Thrift (Facebook – just rpc)
– Protocol Buffer(google origin – c++ layer over rpc)
– MessagePack (json in binary encoding)

Other side has had Corba to remoting, web services- JAX-RS, JAX-WS and myriad rest frameworks. You just live with poison of choice like Shiva..just be ready to replace it.

Nothing is right or wrong but the amount of technical debt you build up is amazing. Having worked with customers, applications over time I some times I mull over best solution which can evolve. Experience has taught me unfortunately some of the choices stick around for longer and evolution is challenging to say the least.

I also hope IDEs do not obscure the working behind a single click. It is the biggest disservice in name of productivity as a generation of developers put something together but have no idea of how these things work. Idea of doing F5 based projects to “quickly show” something without explaining what is happening underneath has created a heavy burden of debt. Sad part is removing these “what is happening” issues is not great use of time and energy. It should be simple, clear and not obfuscated to protect people from complexity. Definitely not where you have gladly sent 1000s of objects with arrays of data…

https://servicestack.net/text, http://www.newtonsoft.com/json are the best choices for really efficient json encoding if you can’t suddenly move to MessagePack or others.
It is worth paying to servicestack for the efficiency they bring compared to default .net xml/json serializers.

On personal level A pragmatic Web API with right amount of marriage with “actions on resources”  is what I push for when customers request design reviews. It may not pass all the “rest” tests – but it is much easier to evolve.

Debt of evolving RPC mechanisms

What do ISVs trying to bring their solutions to Cloud want?

Easy to understand Billing model  

Make it easy to reason about the billing model, simpler than what is  exposed to “pay per use”. I need to use it every day. It should just work without surprise. Do not expose the – “you looked at me – y $, you asked for that z$”. Please provide reliable API that I can utilize for creating SaaS applications.

Tell me about your maintenance cycles (please) –

For end customers using a  solution, downtime communication is essential. Ideally 24*7 operation is required but we can craft a solution which can deliver minimum viable  option at lower cost.

MultiTenancy

It means a documentDB/Aurora or Search should have ability to create “tiers” for free/shared instances where I can club in folks for my “freemium tier” without paying production amounts. As it is very low margin business let us find ways to make it simpler. This is little bit different from me creating a shard instance.

Support for MultiCloud Libraries/stacks

We need support for jcloud, fog, libcloud across Provisioning, monitoring, billing of all possible assets.  We understand it will not be a odbc standard but something more workable. Provide deeper integration into chef/puppet/ansible/salt with better templating than promoting custom “provider models”.  Thanks for integrating with github…push it as alternative to store assets. So that config (testing/deployment) etc everything is coded up and stored in github or something similar.  Thanks for support for docker, coreos.

What azure is supported only for blobs in one of them somewhere(libcloud)? No powershell is awesome but not everyone’s favourite piping tool.

Win-Win

I bring you x $, you provide me 0.20%x. No really – make the partnership work with real people rather than english.  Let us find a way to make the adoption faster.  Help us unseat the existing partner brokers who are deadweight – whose deployment/AMC (people/cost) models are a challenge in pure cloud model. That air cover we talked about needs to be about partners, partners, partners. Help unlock the cio-tech-team ice. It is not about x% discounts on the platform.  Focus on that annual sign up stuff for certain software licenses will not open door to growing pie.

Here is shout out to Vijay who joined MongoDB and he correctly  points out “lack of lever” with both customer and seller – there is no  complexity. http://andvijaysays.com/2014/03/25/are-we-there-yet-cant-wait-to-start-my-new-adventure/

In cloud based setup simplicity is much more stark.

Support

Real support in terms of what does not work rather than “green my  scorecard” – so just use it(shove down my throat). Own up the support  issues and help bring down my costs and increase your spread. Get folks who understand both business and technology(people outside use different from what you sell). Let us know at what is coming down which can potentially make us commodity. Be honest about it.

No unless I explicitly tell you don’t push a service. I will pick unique services based on their strength, honestly I will. Love completely hands off 99.99 % Sql Azure where I get backup, HA all in great price. Wished that infra was available for others to host stuff like DB.

 Make it supportable

Other OS is as useful and widely deployed so tools for picking up monitoring information should become better. IIS is a great tool but so are nginx , apache and their friends ha-proxy, squid, varnish. Make “separation/divorce” easier. Easier to withdraw data, easier to withdraw configuration settings – UX should reflect what is possible through powershell, cli and at worst language specific rest bindings. Preferably a language which runs on all platforms.

What do ISVs trying to bring their solutions to Cloud want?

New age media challenges – muzzling the opposing views

TL;DR;

Organizations like Twitter, FB (social media) or Search organizations need to share what is the way they decide what is right/wrong on their sites beyond legal words. How they decide which view of which participant is muzzled. 

In https://www.youtube.com/watch?v=gYN6uybDKzY Twitter CEO Dick Costolo says people have to assume information will be available to all. Emphasis on word assume.

In https://www.youtube.com/watch?v=J-y8TcHT8Lg Twitter co-founder Jack Dorsey
recounts his becoming the entrepreneur. He exhorts people  to join the movement and question everything.

Earlier populace had to depend on media – printed media to take the views of people
to the leaders and vice versa. Unfortunately like the incestuous relationship of
auditors and companies in private world – lot of give and take was done and watchers
became the mouthpieces. Overtime interest groups realized they need to control the
media to shape viewpoints and pushing of their agenda. Now we have overt politically biased media houses catering to their captive audiences.

Social media birth and evolution helped cement itself as one tool for people to
exchange ideas, information and possibly form opinions. Sadly it also came with tools
to analyze what is being said and ways to block the “opposing” view by simple
“block/report”.

Corporations, ruling entities could easily circumvent or block an unpleasant
question.

Challenge is tool like twitter has not made lot of things transparent. It is like the
chinese firewall but controlled by few people sitting somewhere in CA. Just like
uber, AirBnB we have little commitment or understanding of issues and claim to
disruption without iota of responsibility.

There was move to get old-media folks as editors? or advisors in some of the social
media organizations. Ideas like protecting the source of information, ideas like
allowing questioning not hate filled agenda – who decides what gets on timeline Who
makes these decisions? An algorithm ? People – Who are those ? What are their
political, religion, institutional biases ? Good way to see these biases is to
compare an Al-jazeera and guardian , BBC, ABC, Fox News, MSNBC, Xinhua, Google news  for an event in Gaza, Europe-Russia events, China or India.

For events which are called terrorist events – a certain section will paint it as
“suspected gunmen”. Some organizations will put a religious tone by including larger
context and attaching religious imagery with words, groups, faith adherence. Or
sometimes there is complete blackout of news as in some “controlled” countries.

Tools like Google news twitter, facebook and others need to come clean on
– what is the ranking for feed– really what is it that you decide our world is –
whether a search engine, timeline or the wall . Are you providing governments,
organizations way to control what we see/hear even before it comes online or muzzle.
– what is ignored , what is given more weight
– what is blocked – at least a notification that you have been blocked without
disclosing , in case of search results – just how do you decide to show what is on those pages. what got ignored/blacklisted.
– how is unfolding of “non-popular” but obscure important stories, events, views
done? Is there a metric here for people to follow?

This is to avoid biased coverage like the printed media does because of any
affiliations (owner – fox/aljazeera or network18 here locally).

What does this mean?
As originally said we will need to be ready to withstand opposing and unpleasant
viewpoint. And let laws which are less stringent than french laws for questioning
others be more prevalent. This has geopolitical connotation – earlier media could be
controlled easily by not allowing airwaves or print media or import of books. Sadly
digital world is much more easily controllable and its disappearance is much more
silent. Your search results can disappear, your tweet could be muzzled or facebook

This also means the role of PR/Media advisors and tools which do topic and sentiment
analysis(however broken) needs to become “auditable” across organizations with laws
backing up.

The tough challenge is digital media allows photographs, videos and other assets to
be put online which have much more shocking impact on people watching them. They are also considered powerful propaganda material which organizations, governments want to control.

Examples a sadist organization like IS using it to recruit, influence a
section of people. These organizations balance out “negatives” with “posts of
positive” actions – “helping the neighbourhood etc”.

The reason government carry out muzzling is to either favor curries for the ruling or
the perception of being right. This could have deep festering origins – China still
seething from opium trade or indignities of Nanking. India not liking the questions
around favor to near-dear ones of the ruling section or certain actions of police or
investigation agency somewhere. Or worst to control the opinion or questioning
itself.

Other stronger reason is throughout our history we have had specialists who claim know
economics, foreign policies and certain people control political agendas. Only
certain agencies and people are considered competent to know and take actions on
certain things.
For instance I personally think it was brave of American folks to question methods of
its intelligence agency against snowden and other revelations. Not every country
either has the guts or desire to explore those depths because of perceived guilt or
affront to pedestal status of being right. Sadly other countries and people who are
saying “we said so” – have much more corrupt and unaccounted actions. See Turkey or
Saudi Arabia or for that matter developed country’s surveillance and treatment of
prisoners (political/ideological/war) or any other UN country. Because war and intelligence and interwined and latter is important for lot of things. Some of the police organizations in other countries are more tougher and have unspeakable tactics than compared to the agency which was admonished. But that fact was never bought out by mainstream media or the digital folks.

Folks adept at misusing will do so and have potential to misguiding populace over
religion, language and perceive impact of abortion/gay marriage on local customs.
End of the day it Us-Vs-Them is the end tool in politics, corporations or local
communities and these tools should not become pawns for these purposes. There are
countries which chose to import few things, viewpoints and want to control others. It
is amazing to see Obama who is leftist and has unleashed more drone based attacks and
ended few wars painted more unpatriotic in that country. Almost every right-side
everywhere across considers themselves more patriotic and leftist/liberals are
apologists.

New tools focus on dissemination of information but this control right now neither
rests with governments (at least not explicitly) or people who are fed these. We do
not have oversight of good editors who decide serendipity, local context, issue
weight/counter opinion. Everything is instant – trends for today, popular now and
immediately just like that incidents are pushed off the main screen. Although
language constructs prevent semantic or topic based search. The dominance of few
firms in each country and region prevents healthy conversation and next steps to open
them up for everybody.
I heard locally – local city folks do not want agencies or themselves on social media
as they need not be answerable or keep countering the viewpoints. It is easier to
control physical news a/v, print media by buying them out or dumping few ads. It is easier to overcome digital media by not being on them.

Sadly we need not choose this future as we see some good possibilities of traffic police on
social media.

But beyond this a common man needs a way to know why his voice was muzzled. Context
– I asked
@PMOIndia it is time to have swachpeople first and speak less with more
action.
@PMOIndia about use of very colorful and respectful language threatening
mothers, sisters and death by ruling party MLA against a Medical officer for
reinstating his “corrupt” relative.

– I asked @economist About their language use for response to Boko Haram. Verbatim
text – “the group has been boosted by the impotent reaction of regional governments”.

In former case I duly expect cells of the ruling party, PR teams finding “offensive
viewpoints, questions” to be reported and blocked. This is very similar to PR teams of corporates who have to contain the “percieved damage” and “move on”. Which they duly did by “saying suspension is enough” but no police action required for person who has done this earlier too.
In latter case a respected news organization which should ideally have just expressed
regret over the language and moved on as “macho” responses are more acceptable across the cultures and has different connotations in perpetrators and victims. Offcourse in new scheme of things person with help of software decided my question was neither worth answering nor thinking but effectively requires banning. Sadly Twitter abetted it.

Sadly twitter failed me in both places, it did not bother telling me my tweet was
“blocked/banned” or just wiped off. It lost its credibility in objective evaluation
and rather let a machine algorithm take precedence. I am sure a celebrity like
Appelbaum’s views will not be muzzled (just a guess) but some obscure person
somewhere is an ok target.

We are back to owners and listeners and the incestuous relationships of auditors and
their clients. Our owners across media , corporations and governments have found new
ways of aligning their mutual interests. This unfortunately technology can’t overcome
by “RideWithX” or “IndiaWithxx” tags and self congratulating themselves. We have
darker future where information can be taken off without a trace and viewpoints
created with bunch of hired hands.

New age media challenges – muzzling the opposing views

Azure Linux THP

You should read the compatibility of your application with THP.(transparent huge pages)

How do you find its status

cat /sys/kernel/mm/redhat_transparent_hugepage/enabled  
grep -i --color huge /proc/meminfo
sudo sysctl -a | grep hugepage

at present you will see cat /sys/kernel/mm/transparent_hugepage/enabled telling it is enabled [always]. Other commands are other ways to see the usage.

How do you modify it? 

1. Edit /etc/rc.local or better yet /etc/sysctl.conf  . WRT rc.local add

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi

2. Add “transparent_hugepage=never” to the kernel boot line in the “/etc/grub.conf” file.

 

Oracle – does not like THP.

Mongo – does not like THP and prefers 4k pages.

Cassandra – There was a thread on twitter and the google group wrt THP.  Looks like suggestion is to disable it.

Hadoop does not like THP.

Splunk does not like THP.

MySql does not like THP.

Postgres does not like THP.

What does it do – here are the details. 

Azure Linux THP

Azure Linux tip – swappiness

In general folks disable swap of memory bound processes for linux instances (ymmv).

How to detect swapfile is present

1. grep -i –color swap /proc/meminfo

2. swapon -s

3. free -m

You will get confirmation no swap is setup. If you check for swappiness via cat /proc/sys/vm/swappiness though you will see swapping of default 60 :). Question on your mind will be where it is doing the swapping.

What should you do ? In general no swapping is good thing, so setting that swappiness to 0 is good thing with default installation. In case you require swapfile(which you will – if you care about latest kernel changes), Add a swap file based off local disk(sdb1 on the /mnt mostly or ssd ) on the guest (do not add azure storage) for the instance.

How to modify swappiness  (for a web or file server) – echo 5  | sudo tee /proc/sys/vm/swappiness or – sudo sysctl vm.swappiness= 5 – To persist this setting through reboots it is better to edit the /etc/sysctl.conf and ensure add the swapfile to fstab. No swapping is good for lucene workloads(solr/elasticsearch), databases (cassandra/mongo/mysql/postgres etc) but for stability reasons at high constantly peaked machines- it is good to have local disk/ssd as help

How to allocate swapfile  usually you will do it on local disk – use df -ah to get mount name) —- Allocate swapfile

– sudo fallocate -l 4G /mnt/swapfile (ensure size is double the memory size)

— Ensure root has access

– sudo chmod 600 /mnt/swapfile

– sudo mkswap /mnt/swapfile

– verify free -m

– add to fstab

– sudo nano /etc/fstab **** add line *** /mnt/swapfile none swap sw 0 0

To switch off swapping completely On Linux systems, you can disable swap temporarily by running:sudo swapoff -a.

To disable it permanently, you will need to edit the /etc/fstab file and comment out any lines that contain the word swap.

To ensure swapiness is switched after reboot

# Set the value in /etc/sysctl.conf
sudo echo ” >> /etc/sysctl.conf
sudo echo ‘#Set swappiness to 0 to avoid swapping’ >> /etc/sysctl.conf
sudo echo ‘vm.swappiness = 0’ >> /etc/sysctl.conf

Why to swap if nobody likes swapping and it is not 90s – For safety.  From kernel version 3.5-rc1 and above, a swappiness of 0 will cause the OOM killer to kill the process instead of allowing swapping. (ref – http://java.dzone.com/articles/OOM-relation-to-swappiness ) While you are at all of this do notice – df /dev/shm and see what you can do about it. Do you want to use it?

Ref –

  1. ElasticSearch – from strong  bootstrap.mlockall  – with  suggestion swappiness to zero to switch it off  and also instruct oom not to kill it, http://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html
    1. When Otis says something – I just follow it. http://elasticsearch-users.115913.n3.nabble.com/mlockall-vs-vm-swappiness-td4028126.html
  2. Solr –  (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_tuning_solr.html )
  3. Cassandra http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html
  4. MySql – https://mariadb.com/kb/en/mariadb/configuring-swappiness/
  5. MongoDB – http://docs.mongodb.org/manual/faq/diagnostics/
  6. Postgres –  it is the same suggestion.
  7. Oracle – http://docs.oracle.com/cd/E24290_01/coh.371/e22838/tune_perftune.htm#COHAG223
Azure Linux tip – swappiness

Azure throttling errors

Most of the cloud services provide elasticity  creating illusion of unlimited resources. But many  times hosted services need to push back requests to  provide good governance.

Azure does a good job providing information about this throttling in various ways across services. One  of the 1st service was SQLAzure which provided error
to help the client to retry. Slowly now all services are providing information when they are throttled. Depending on whether you access native API or REST endpoint you get this information in different ways.  I am hoping slowly comprehensive information from services and underlying resources like network, cpu and memory starts percolating like storage so that client, monitoring systems can manage workloads.

Azure DocumentDB provides throttling error and also the time after which to retry.
(HTTP error 429 ) . It definitely is ahead of other services for providing this exclusive information.

Azure Storage on other hand provides information to the native client so that it can back off retry. It also pushes this information into metrics. A great paper exists which provides information about Azure transactions and capacity.

SQL Azure Throttling    – was one of the 1st services to provide throttling information to due to crud/memory operations(45168,45169,40615,40550,40549,40551,40554,40552,40553).

Azure Search throttling provides HTTP error 429/503 so that client can take proper action.

Azure Scheduler provides HTTP Status 503 as it gets busy and expects client to retry.

Azure Queue, Service Bus Queue both send back 503 which REST clients can take advantage of.

Biztalk services  provides “Server is busy. Please try again”

Over # of years we always request customers to exploit Azure and one of the ways is to actually is to work with hosted services and plan workloads by catching these kind of errors. Some of the customers like SQLAzure’s throttling so much they wished they want some of those soft/hard throttling errors in on-premise database.

Most of the Azure services do not charge when quotas are hit or throttling is done. Idea for the client is to back off and try again. I hope though the “monitoring” becomes better say for example in case of biztalk services – a client should be able to query the “busy-ness” since it has to try after systems becomes less busy. SQlAzure’s retry logic has been the well codified and understood over years.

Just in case you wonder other public cloud services too have throttling?  Public cloud services are shared infrastructure and implement throttling for governance. Throttling is implemented and it is exposed in different ways. DynamoDB for example – has 400 series of error codes with specifically LimitExceededException,  ProvisionedThroughputExceededException, ThrottlingException as an example. Almost every service has 400 series of errors with Throttling as specific exception.

Azure throttling errors