Respecting a schemaless(dynamic schema) store’s strength

Recently I was contacted by multiple folks using a nosql/datastructure store for guidance about daily operational issues, bad data & lot of finger pointing.

In one of the case – The main product manages the click impressions and tries to give “perspective” on views of the ad. Unfortunately this is “mixture of solutions” each with own application life cycle and ownership.

Issues

1. Mismatched data types
Incident – Application 2 started storing arrays where string was expected. It must have been because of the change in other app when requirement changed from “single value” to multi-valued item. . This error was caught in the few filter queries & reporting intermittently. It was difficult to catch the issue as it would happen for certain filters. Other impact is increased storage. If this information was/is small  -a simpler type could have represented the data.

2. Subtle change in the name of the document element resulting missing data
Incident – Some application started reporting incomplete data. Cause was wrong name change resulting in a new field resulting in bigger bug.

3. Change in data type precision
Incident – suddenly values did not match up in reconcilliation, a float was changed to integer (gasp).

In document store by the nature of it being a document database it is very-easy to simple mistakes

e.g.,

>db.test.save( { x:1 } )

>db.test.find()

– Now store string for same key without an issue
>{ “_id” : ObjectId(“508b3d791a1d9f06773cb597″), “x” : 1 }
>db.test.save( { x:”str” } )

>db.test.find()

>{ “_id” : ObjectId(“508b3d791a1d9f06773cb598″), “x” : “str” }

> db.campaigndetails.save (
{ _id: “xkcd”, “details” : { campaign_id: 1, inventory_id: 2, audience_attr:3 },rev:10 })

> db.campaigndetails.find( {“_id”:”xkcd”} )
{ “_id” : “xkcd”, “details” : { “campaign_id” : 1, “inventory_id” : 2, “audience_attr” : 3 }, “rev” : 10 }>
> db.campaigndetails.find( {“id”:”xkcd”} )
Is that a store’s issue? Nope it is the user in this case a developer’s responsibility to take right measures so that these situations are avoided. What are other repercussions?  Let us say there is a “flow of data” to another place. All the etl/messaging kind of processing have greater chance of failing if the type changes or extra field gets added.
How can one overcome it

1. Use common data repository access code
2. Use common data repository validation code
3. Use common messaging transformation code
4. Use basic layout verification code
5. Since many of the document databases do not support joins folks created either  embedded/hierarchical or linked structure. Embedded schema has an implication for changing the “value” of related item and many issues get reflected here.

6. Reconcilliation of catalog (schema) with data values – this can be run once in a while to look for any deviation from the agreed upon schema

Another datastructure store example showing dynamic schema

redis 127.0.0.1:6379> set car.make “Ford”
OK
redis 127.0.0.1:6379> get car.make
“Ford”
redis 127.0.0.1:6379> get car.test
(nil)
redis 127.0.0.1:6379> get car
(nil)
redis 127.0.0.1:6379> get car.Make
(nil)
redis 127.0.0.1:6379> set u ‘{ “n”:”f_name”}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”:\”f_name\”}”
redis 127.0.0.1:6379> set u ‘{ “n”: 1}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”: 1}”

Sadly people are throwing baby with bathwater in their hurry and forgetting somebody will need to maintain the Data and derive value from it. Data is sacred and it’s sanctity should be always maintained.

There is an issue of schema design which has  influence in some of the above issues. Rather than blindly “embedding” documents a simple check should be done for “immutability”. If there is possibility of change ,lot of data is getting duplicated with growth. It is better to keep it separate. Another suggestion would be to keep the names extra-small.

 

Update – Article on Cloud and Hive

Suprotim and Sumit pushed me to publish an article on “Decision Making Pivots for adoption of Cloud” – This is basically gist of guiding principles we use to help customers to migrate to cloud. We have few variations for enterprise strategy where workloads like exchange(email), sharepoint (collaboration) or CRM need to move to cloud. We have a colleague MS Anand who helps customer on the Private cloud adoption front to create efficiencies out of existing infrastructure. Here is the document which focuses on Azure and was part of the magazine.  Azure Adoption – pivots to help make right decision

I just completed something else I promised Suprotim – an article on comparing Hive for people who are used SQL as dialect to interact with Database. Although comparison of database and hive is not strictly apple-to-apple comparison. I wanted to take an approach where understanding BigData does not become a burden of learning MapReduce/hdfs and overall hadoop ecosystem. It is much easier to start  doing something very simple that we do with regular data store and try to do it with Hive and then start looking at differences. It also helps to understand why HDFS and map-reduce are helping in addressing scale and availability for very large amount of data. Although there are tools like Pig/Cascalog/Scalding/Cascading- I decided to focus on HiveQL as it is closest to SQL dialect with simple intention of not introducing many new things simultaneously.  Once the article is out for a month in the magazine – I plan to share it here again or you can pick it from http://www.dotnetcurry.com/magazine/dnc-magazine-issue2.aspx (updated – 1st Sep 2012) once it comes online.

And if everything goes allright with help and push from Vinod & Pinal – I will devote energies toward something more useful.

Update – 1st Sep 2012 – Things I have not covered in Hive Article-

-There is tab for DisasterRecover on hosted azure which when switched on – It is not clear
—–WRT NameNode
———–FsImage and EditLogs  are backed up every x minutes to another “managed” location.
—–Is there a secondary namenode where log/checkpoint gets shared to
———–There is a secondary namenode but execution of command against it in RDP simply hangs.
—–WRT HDFS data
———–Is it snapshotted/backed up to Azure storage and takes advantage of inherent replication there.
—–WRT Hive Metadata
———–Is it backed up to “Managed” location every x hours/minutes

-If NameNode crashes – Not clear now (WRT to HadoopOnAzure)
———–Whether AppFabric services inherent in Azure are utilized to identify and bring it up  & use the earlier “managed” location  (using –importCheckpoint option)

-Upgrade & rollback of underlying version will be part of HOA’s lifecycle management. Assumption here is at present one version will be prevalent across tenants. Upgrading individual clusters to different  version is not supported.

- Addition/deletion of nodes into existing cluster

- Adding incremental data

- Adding a Fair scheduler

- Monitoring the job progress/cancellation

- Identifying bottlenecks  in JVM/hdfs settings

- Dealing with hadoop fsck identifying bad/missing blocks and related issues

- Rebalancing

 

Updated – 20th Sep 2012

Cloudera posted a wonderful article on using flume, oozie and hive to analyze the tweets. http://www.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

Hive article & Azure adoption article

I recently wrote an article for DotNetCurry - putting down experiences with Azure migration pivots.

I will be jotting down experience and basics of hadooponAzure in same way.

I intend to cover how to look at hadoop from database user’s point of view. It will cover storage, query & loading of Data using apache projects such as pig/hive. We will try not get down to map-reduce jobs as starting point as they tend to cloud the judgement for adoption for administration/developers familiar with SQL dialect. They depend on it to define schema, query. We will cover availability (strongest point of hdfs), scalability (hdfs -easy adding of nodes) , querying (pig/hive) . This article will not cover machine learning, performance tuning of hdfs/mr jobs, installation, management/monitoring.

I will try to publish a link to the word document with errata for the azure article here one of the days.

Data Visualization – Keeping it Simple

This post is dedicated to economist magazine’s data science crunchers and visualizers who toil week after week to produce high quality  material. I have taken liberty with some of their content to show how dedicated they are in their craft. The other  organization is the Guardian the newspaper – they have excellent data team.

Context
In week of July 23rd – we will have BigData conference (5th Elephant)right in our backyard – courtesy the hasgeek team. Hopefully attendees will take advantage from the 2 days packed agenda. Right now early bird discount is on. The focus is on tools, processes and best practices. It will have special section on visualization. D3, Excel, process, python will be talked upon and discussed. There is off course much awaited session from Anand.

For me it is not the infrastructure but the questions that we ask of data that matters.Once on that path visualization is great help to unfold the meanings, co-relations, insights. Some tools are better suited than others to convey the “gist”.

Data visualization is intricately linked to interpretation of the issue. Following are the items from economist magazine where data shines to bring out the issue in open. One of them is an aberration as it takes longer to comprehend the gist. In general most of them get the point across.

The easy ones

European Bank capital requirement under severe recession

a. Hell Holes is as simple as they come – we have clear title, description on the top and source at bottom. A simple bar chart plots countries against the requirements. White vertical lines accentuate the boundaries. All this helps the viewer grasp the information very-2 quickly. We are made aware that Banks need lot of pull in terms of money if worst comes to worst (Spain was just bailed out)

Cost as % gdp for environmental degradation

b.Eat your greens plots the % of gdp as cost of environmental degradation for various developing countries. Latter part is just educated guess and G-20 are not part of the display.

Different strokes

Loan burden

c. Different strokes is classic economist staple chart showing 3 different “metrics” for 4 different countries over same number of years. Simple inference one can draw is that Spain is better of the lot except in employment. It is directly co-related to the construction boom which was present in Spain. Apparently one more graphic also shows how banks lent for property loans more than anything else.

Happy with monarchy

Happy with monarchy

d. Reign rather than rule shows people still like monarchy over their “satisfaction with voted government” in Britain. Country name is missing – but since this graphic is part of the article focusing on monarchy in Britain, I have taken the liberty to provide that information. We see interesting delineation of the century where explicit qualified 4 character full year mention of 1997, 2000, 2012. The years between 2000-2012 are mentioned in 2 characters but we are sure of where this point lies in timeline. There is small vertical line on horizontal axis on either side of the year timeline. I am assuming it just indicates “continuity”. Horizontal lines again provide clarity in terms of the slope of the line.

Ethnic Hotchpotch

Ethnic Hotchpotch

e. Ethnic Hotchpotch in kenya graphic shows how fragmented the country is with various tribes making it up as % of total population. Interesting graphic is the 1st horizontal row depicting the “Kikukyu/Meru/Embu” tribes. It shows a break in the horizontal bar as the graph only accommodates 15 % depiction. Wisely the rest of the proportion is shown outside the 15 % and indicated by the “quantity- in this case 27″. Graphic also clarifies lot of details in terms of “covered data”.

Houston we have a problem

f. Houston graphic shows various city population density per hectare against CO2 emissions(in tonnes) person. The same size circles are scattered all over with white ones pointed out for differentiating them from others. It does bring out the fact that both Texan cities are in bad shape even though people density is very less.

Foreign or local

Foreign or local

g. Stacked bar chart shows out of the total vehicle sales, folks in china still prefer foreign brands. Here again the background white bars accentuate the data.That slight extra overhang does a beautiful job of helping to differentiate the numerical indicator.

Anti social behavior

Anti social behavior

h. Anti social behavior graphic has two graphics – line chart plots two quantities – number of trips, distance. Both of them are going down. The pie chart shows the “reason” for the trip and % change from 1995-97. There two small issues here , title does not indicate the country so it does not stand by itself. Pie chart indicates it is change between two years but reference old year is given as range 1995-97. Shopping trips could have gone down as people do multitasking – as they might pick up groceries while coming back from home. This is made possible if convenience store is near the transportation link. Why visiting friends has gone down is strange  but could be explained by social media presence :) ?

Merkel the popular one

Merkel the popular one

i. This particular graphic plots German chancellor’s popularity across Mar-2010 to May-2012. Events, Bailouts and dates are superimposed to provide deeper context. Angela Merkel’s popularity has never really wavered and gone few notches higher. This corresponds to her handling of Spain and Greece. It did waver during  Ireland, Portugal crisis.Although German elections are mentioned it would have helped if “% of seats/representations increase/decrease” was mentioned.

Medium ones

Women at work

Women at work

a. Women at work is simpler graphic with addition of numerical on right overhang with Germany pointed out with different color. Data is qualified for the year 2010. This graphic stands on its own to share important information wrt important bill to help women stay back and take care of the family.

Print is not dead

Print is not dead

b. This plot shows revenue of magazines, newspapers across 2006-2010 plus forecast till 2015. It also indicates 2011 date is an estimate. 2006 revenue is taken as base of 100 thus bringing out the scale of increase in magazine revenue. This base is differentiated with color of 100 horizontally.Interesting addition is the numerical figure which indicates the sales figure of 2010. The color background of the numerical figure also co-relates the areas.
This has to be the most beautiful representation of the data which conveys that magazine/newspapers are surviving with former doing far better than latter. There is a slight background color change for the period of 2012-2015. (just that touch- again shows the dedication)

Stormy seas

Stormy seas

c. Stormy seas graphic shows that container shipping industry might be making comeback after real bad last year. The prices might be going up due to lack of capacity or exhaustion of the excess capacity. While reading the article it was clear that container building industry finally stopped building lot of capacity. Last year prices of freights were down with fuel prices were going up. One phrase which stands by itself and is bit confusing “Asian-northern europe centistoke bunker fuel”. This is way too much of “clarification data”. Use of colors to differentiate the 2 y-axis data and the plot is helpful in making the above conclusion.

tnk-bp

tnk-bp

d. TNK-BP shows one important thing – BP is sucking away the dividends from the subsidiary, it has specially extracted a lot last year(close to 90%). Idea of showing the % inspite of the stacked bar is wonderful as it conveys in % the exponential increase. Looks like this subsidiary is going to find its own money next year.

Easing again - another initiative to push

Easing again – another initiative to push

e. Now a graphic which juxtaposes employment as bar charts across month from 2010 Jan to 2012 June.The employment seems to be falling from its peak of 2012 January.S&P too is going down at this juncture.Two specific rounds of “push” are indicated across the vertical.

Both these above graphics show horizontal axis used to indicate two different metrics. This condensation is important to help comprehension.

Tough ones

CO2 abatement cost

a. Now comes the most interesting graphic which compresses lot of information and requires concentration from the reader. Right from the word abatement down to heights/width of the bars everything has to be explained and understood in its entirety. Basically as far I could see it is trying to say – cost of abatement of co2 by certain year for certain activity and its potential is so much. I would guess it is called a form of variwide chart?

Mcwages and BMPH (Wages and buying power – is the economy slowing down)

b. Now to much better analysis and its representation. This confirms economist estimates of slowdown and also indication of lowering of living standard in US. It also is a good attempt to compare the real wages vs buying power across countries. Author of this study has data about wage and burger prices.  Economists divide the cost of the Big Mac by the McWage to get “Big Macs per Hour” or BMPH in comparing countries. Two graphics together tell us that us, canada, western europe, japan have similar wages. The developing countries – specifically India earned about 10% of the developed world.In the U.S.,  BMPH was 2.4 in 2007. McWage is up 26% in four years, but the cost of the Big Mac is up 38%, due to possible increases in food prices. The net 9% drop in US BMPH is one sign of a reduced overall standard of living. In Russia, the BMPH increased  152% from 2000 to 2007 and has increased another 42% from 2007 to 2011. China has had increases of 60% from 2000 to 2007 and another 22% from 2007 to 2011. India saw a large increase of 53% from 2000 to 2007 but  BMPH declined by 10% from 2007 to 2011.

window-of-opportunity

window-of-opportunity

ural-falls (part graphic)

ural-falls (part graphic)

ural-falls (part graphic2)

ural-falls (part graphic2)

c. Now the most interesting set of graphics which indicates govnor trading was either a signal or simply driving the ural prices(a heavier crude with more sulphur) compared to brent. It does not have the brent data but ural data from Jan 2005- May 2009 is plotted with periods of “guvnor” trading. Most of the time the trend/pattern one can see is that prices fall down. But we need to temper it with brent data to see any co-relation. The conspiracy theory off course is that guvnor bought the oil at lower price and sold it to Europe at higher price. Now …isn’t this great detective work. Apparently this study was done over 2 years.

One common thing you will notice is the red bar for every graphic. Sometimes if article has multiple graphics they also have # in right top. Consistent and concise and all that Data science work proving  it is the question which matters and not the tool/infrastructure knowledge. They can be developed/leveraged. Right now many folks are struggling why/when to use BigData(how to get it, store it, analyze it).

Books on visualization -

  • Show me the Numbers, by Stephen Few (Amazon)
  • The Visual Display of Quantitative Information, by Edward R. Tufte (Amazon)
  • The Wall Street Journal Guide to Information Graphics, by Dona M. Wong (Amazon)
  • Visualize This!, by Nathan Yau (Amazon)
  •  Information Visualization, by Colin Ware (Amazon)
  •  How Maps Work, by Alan M. MacEachren (Amazon)
  • The Back of the Napkin, by Dan Roam (Amazon)

At times visualization clutters up the information and leaves the reader confused. This is one case of NYtimes. How do the middle & top quintile person compare? Sometimes just plain representation of data is wrong – How can accuracy be 120% – in case of paper on analysis of liver issues. So overdoing it, doing it for sake of it is not advisable. That is why I like economist’s use of plain graphics. They get point across without cluttering up too much.

Although there are many visualization websites – this is neat.

Data Quality Services(DQS) – deduplication/matching Hindi data with SQL Server

This is an attempt to do 101 of 101 (Vinod and pinal know the reason why we have to do this :( )  .  I have been interested in de-duplication for longest time and wished DQS was released earlier so that with 2012 they could have had “fresh” start.

In any case here we are looking at data from http://fcs.up.nic.in/upfood/helpline/ReportRegidWise.aspx. Idea is to find potential duplicates using set of attributes like card holder name, father/husband name, mother name. This data is represented in excel here. (data could be present in SQL too).

Image

To do matching a dqs project of sql server (matching policy) needs to be created. Image

We need to choose the data source which is the excel in our case.

Image

Now we need to choose the source of data-sheet number where data resides. Row header option needs to be chosen to indicate first row has the names of the columns. Image

Important element of matching is defining the domain – which is essentially defining data about the data. In this case metadata about the column we are going to examine. It involves giving the name, datatype at minimum (data/integer/decimal/string). For name columns we are going to use string data with language as “other”. This ensures Hindi etc can be compared.

Image

Once the domain is defined we need to map the columns to the domains which is pretty straight forward process.

Image

Now we need to define the rules for matching. In our case we are going to compare the rows based on 3 attributes(card holder name ,father/husband name, mother name). One can do exact search but we will choose the similarity based lookup. We also need to give weight to the column which will be used to dominate the comparison. We have chosen 70,20,10 across cardholder,father,mother names.

Image

After this is simple start or profiling.

Once you look at the matched results – there is Mr. Kailash who has same sounding father in two villages  – potentially a duplicate.

Image

Then we have exact matches

Image

The profiler can show the need of data cleansing/messaging. Honestly every de-duplication exercise should first undergo normalization/cleansing process followed by physical verification process. As data can indicate only so much. It is also better to have more columns to compare to provide granular control.

Nearly 40% data for mother’s name is missing as is card # :( .Aadhar can simplify the enrollment by cleaning up the duplicates from here and comparing it with others – unfortunately they wanted “completely new” enrollment etc.

Image

Background - 

Dedupliction and cleansing is required while creating DW or just cleaning the data. Imagine creating a customer 360 view where data for customer is coming from multiple systems. Unfortunately as it can happen information about cusotmer in different systems can be represented in slightly different way. One needs to cleanse this data and de-duplicate before using it to make decision.

String comparision functions in SQL Server -
SQL Server has = operator which can help to compare similar values. Like operator goes a step further to match approximately similar strings but misspellings are not its strength. Soundex/Difference play a role to compare phonetically two strings. Master Data Services adds Similarity function. This function is unique in the sense that it adds Levenshtein edit distance, Jaccard similarity coefficient,Jaro-Winkler distance, Longest common subsequence algorithms.

Jaro-Winkler is usually the best algorithm but should be verified with own data.

Hopefully in future DQS/MDS/Phonetic pieces are integrated for simple usage.

Use Case Scenarios
Banks/Insurance agencies, Government programs (PDS/election card) all have data which needs cleansing and scalable de-duplication services. Sometimes data entry errors like reversal of first & last names results in not matching of names.
Generally data-duplication involves little bit more complexity – where other attributes like age, location address etc also play a role. If age is within months vs years is an expensive query. Location addresses have an issue where addresses are not standardized like US.

Name matching by itself can be applied to detect frauds or law-enforcement or counter-terrorism.

It is a hard problem because of
-misspelling, variations, cross language bias, nicknames, name ommission (english middle names are ommited, Spain-maternal name can before first name), Data Entry errors.

Comparison of methods

http://secondstring.sourceforge.net/doc/iiweb03.pdf

Advanced methods

Many of the algorithms/techniques come from BioMedical world – especially gene-sequencing-matching world. Most of them focus on gaps/transitions. One of them is  Smith waterson  Gotoh – http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/gap-penalty-gotoh.pdf. For example – Sequence similarity search – http://biochem218.stanford.edu/07Rapid.pdf , or http://www.bioinf.uni-freiburg.de/Lehre/Courses/2012_SS/V_Bioinformatik_1/lecture3.pdf

Good Reference – When you have lot of time – A guided tour to approximate string matching – Navarro (ACM paper)

Phonetic methods apart from soundex are
NYSIIS
Phonix
Editex
Metaphone
Double Metaphone
Phonetex

Not all of them are applicable to all languages and available in all databases.

SSIS component for use

Lookout for MelissaData which does matching on multiple columns, works out of SSIS and has various algorithms implemented as part of it.

OSS

SecondString.  (sourceforge)

Desired Features

Need for reports/API
- DQS needs to support offline, api based interaction to generate reports. This will not require login into the application.

Need to have more transparency about algorithms/pluggability for different languages
- MDS algorithm should be combined here and choice should be available for choosing the relevent algorithm. Similarly stemming algorithms should be pluggable and available for “choice”. More phonetic algorithms should be added.

Performance guide

-http://www.microsoft.com/download/en/details.aspx?id=29075

One stop guide for DQS

-http://technet.microsoft.com/en-us/sqlserver/hh780961

 

 

Session summary at TechED 2012 – NoSql(Non Relational Store) for relational person

Like every year I opted for session delivery. One of the session was picked up at last minute and the organizers (Harish/Saranya) approved it immediately. Unfortunately I was little late, anyway scrambled a bit and got the deck and demos up. But I did not pray to demo Gods. Right from display to connectivity to remote server everything played up – this after checking the display 2 days ago and connectivity everything working till last minute. I had backup connection too.

Big learning

Backup for network is another network, but backup for machine is needed. If possible locally. Just don’t remote into different server machines for demo – however comfortable it is at other times.

So anyway this session is 101 for people who are comfortable with relational databases and want to understand why/when/what to use in their scenarios. I chose Redis, Riak, MongoDB, Azure storage/SQL Azure to showcase for 2 min exploration each as this could not have been tutorial for them. I did not have time to explore the MPP/Columnar stores or get deep into how, idea was to convey – why/when and the possible impact. I also did not get into amazon store(s) or coherence, inmemory db etc.

I chose Redis – as it brings familiarity of memcached/velocity, Riak because it is sort of everything from kv,document to search store and ability to add/delete nodes is simple/powerful. I chose MongoDB to just de-mystify the storage – access via indexes/mr of javascript.

Whole idea was to help folks contrast these stores with Relational store from pov of what they are comfortable with (indexes, joins, acid, schema,
monitoring, management). One of the simplest way to skin is to ask question about range queries, index/updates.

Here is the link to the presentation - http://speakerdeck.com/u/govind/p/non-relational-storage-for-relational-person

SSIS continues to SUCK

IT forgets the settings on the datasources. It forgets just a sec back it could access the database without any problem. It forgets to remove old metadata. It is the messiest tool ever.

I worked with it on SQL 2005 – it matched the absurdity then, It continues to scale new heights even now in 2008. It is not worth spending time.

I am going back to bcp. It is a pain but I at least get reasonable answers.

On Windows 2008 – it requires you to execute the package as admin as it can’t access perfmon counters(no don’t ask which ones). The execute package utility is sham – it can’t remember it needs to set the /X86 option if it sees the jet oledb provider if either the source or target is on 64 bit machine. Now don’t even get me started on 64 bit provider option – it is a shame that most of the people get/accumulate data through mdb, but 64 bit provider does not exist.

The errors are thrown are gem – CANTACQUIRE…connection – but prey few mins ago – you did without sweat.

No I am not the only one.

Accessing Jet DB on 64 bit windows – “Microsoft.jet.OLEDB .4.0 provider is not registered on the local machine”

Update(5th Feb 2010  19 Aug 2011) - There is beta driver download available        OLEDB provider – Microsoft Access Database Engine 2010 Redistributable – http://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=en


Simple Answer – There is no 64 bit version of DAO provider, Jet OLEDB provider or ODBC driver – it is just 32 bit version of the provider which runs in wow64 mode.

Scenarios

1. 64 bit SSIS/ETL tool package trying to access Access/Jet database natively using 64 bit drivers.

2. 64 bit native applications (forms/console apps) natively accessing the database.

3. 64 bit native open source frameworks (rails/python based stuff) trying to access the database through regular ado/odbc provider/driver.  (lookup the favourite support dl – sorry I do not have answers to adodbapi, odbcmx etc etc…)

4. For pure asp.net applications follow advice of Ken tucker (MVP).

Workaround : One needs to create/use 32 bit application in wow64 mode to ensure 32 bit jet oledb provider can be used.  In simple words – just change the compile target to x86 in Visual studio or make file.

Project Settings ->Build ->Platform target -> change the target to x86.(you can accomplish the same with command line too with /platform option)

[update]To deal with SSIS’s rude behavior on 64 bit – check out the /X86 option to execute package on commandline. Also ensure Run64BitRuntime is set to false in the the metadata.

Brute force alternative (Update) – Use Corflags.exe as last option in dire circumstances for “just trying it out”. This only works for .net images/assemblies. No hex editing please :) , not supported.

 

Reason :

There is no 64 bit version of jet driver (hopefully they do as most of the machines will be 64 bit, it will look pretty odd to have to use this trampoline and not being able to do true migration – not everybody will like to migrate to sqlexpress).

via http://www.tech-archive.net/Archive/Access/microsoft.public.access.dataaccess.pages/2006-11/msg00013.html

Detailed – x64/x86/Wow explaination is available here.

Official word(well blogs are as official as they come)  on mdac roadmap.

Update(5th Feb 2010) - There is beta driver download available – http://www.microsoft.com/downloads/details.aspx?familyid=C06B8369-60DD-4B64-A44B-84B371EDE16D&displaylang=en 

Official statement(5th Feb 2010) – http://blogs.msdn.com/psssql/archive/2010/01/21/how-to-get-a-x64-version-of-jet.aspx

OpenID, OAuth and securing the citadel

Will web become closed in future where there will no way to access unauthenticated data alltogether? Sort of driving license for the of internet? All the swarm of protocols/agreements like OpenID, Oauth, Dataportability , apml will they not drive the content generators to enforce a “id” which can be tracked and information used everywhere?

Nope I do not have problems with all these nanoformats trying to get semantics of the “stuff on the page” , I am just afraid of the “market forces which will ensure access to information unauthenticated is not allowed”, I am just worried about access to the data/information without the need to identify yourself ever.

Update- I got some comment from lawyer type(? – did not leave email address/blog) – that Citadel is “Popular” open source email and groupware platform and it now supports OpenID authentication.

To be fair – I had never heard of citadel before ! and the context too was completely different. I love these subtle guerrilla ploys. This adds spice in ironic way to the current post.