Respecting a schemaless(dynamic schema) store’s strength

Recently I was contacted by multiple folks using a nosql/datastructure store for guidance about daily operational issues, bad data & lot of finger pointing.

In one of the case – The main product manages the click impressions and tries to give “perspective” on views of the ad. Unfortunately this is “mixture of solutions” each with own application life cycle and ownership.

Issues

1. Mismatched data types
Incident – Application 2 started storing arrays where string was expected. It must have been because of the change in other app when requirement changed from “single value” to multi-valued item. . This error was caught in the few filter queries & reporting intermittently. It was difficult to catch the issue as it would happen for certain filters. Other impact is increased storage. If this information was/is small  -a simpler type could have represented the data.

2. Subtle change in the name of the document element resulting missing data
Incident – Some application started reporting incomplete data. Cause was wrong name change resulting in a new field resulting in bigger bug.

3. Change in data type precision
Incident – suddenly values did not match up in reconcilliation, a float was changed to integer (gasp).

In document store by the nature of it being a document database it is very-easy to simple mistakes

e.g.,

>db.test.save( { x:1 } )

>db.test.find()

– Now store string for same key without an issue
>{ “_id” : ObjectId(“508b3d791a1d9f06773cb597″), “x” : 1 }
>db.test.save( { x:”str” } )

>db.test.find()

>{ “_id” : ObjectId(“508b3d791a1d9f06773cb598″), “x” : “str” }

> db.campaigndetails.save (
{ _id: “xkcd”, “details” : { campaign_id: 1, inventory_id: 2, audience_attr:3 },rev:10 })

> db.campaigndetails.find( {“_id”:”xkcd”} )
{ “_id” : “xkcd”, “details” : { “campaign_id” : 1, “inventory_id” : 2, “audience_attr” : 3 }, “rev” : 10 }>
> db.campaigndetails.find( {“id”:”xkcd”} )
Is that a store’s issue? Nope it is the user in this case a developer’s responsibility to take right measures so that these situations are avoided. What are other repercussions?  Let us say there is a “flow of data” to another place. All the etl/messaging kind of processing have greater chance of failing if the type changes or extra field gets added.
How can one overcome it

1. Use common data repository access code
2. Use common data repository validation code
3. Use common messaging transformation code
4. Use basic layout verification code
5. Since many of the document databases do not support joins folks created either  embedded/hierarchical or linked structure. Embedded schema has an implication for changing the “value” of related item and many issues get reflected here.

6. Reconcilliation of catalog (schema) with data values – this can be run once in a while to look for any deviation from the agreed upon schema

Another datastructure store example showing dynamic schema

redis 127.0.0.1:6379> set car.make “Ford”
OK
redis 127.0.0.1:6379> get car.make
“Ford”
redis 127.0.0.1:6379> get car.test
(nil)
redis 127.0.0.1:6379> get car
(nil)
redis 127.0.0.1:6379> get car.Make
(nil)
redis 127.0.0.1:6379> set u ‘{ “n”:”f_name”}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”:\”f_name\”}”
redis 127.0.0.1:6379> set u ‘{ “n”: 1}’
OK
redis 127.0.0.1:6379> get u
“{ \”n\”: 1}”

Sadly people are throwing baby with bathwater in their hurry and forgetting somebody will need to maintain the Data and derive value from it. Data is sacred and it’s sanctity should be always maintained.

There is an issue of schema design which has  influence in some of the above issues. Rather than blindly “embedding” documents a simple check should be done for “immutability”. If there is possibility of change ,lot of data is getting duplicated with growth. It is better to keep it separate. Another suggestion would be to keep the names extra-small.

 

Session summary at TechED 2012 – NoSql(Non Relational Store) for relational person

Like every year I opted for session delivery. One of the session was picked up at last minute and the organizers (Harish/Saranya) approved it immediately. Unfortunately I was little late, anyway scrambled a bit and got the deck and demos up. But I did not pray to demo Gods. Right from display to connectivity to remote server everything played up – this after checking the display 2 days ago and connectivity everything working till last minute. I had backup connection too.

Big learning

Backup for network is another network, but backup for machine is needed. If possible locally. Just don’t remote into different server machines for demo – however comfortable it is at other times.

So anyway this session is 101 for people who are comfortable with relational databases and want to understand why/when/what to use in their scenarios. I chose Redis, Riak, MongoDB, Azure storage/SQL Azure to showcase for 2 min exploration each as this could not have been tutorial for them. I did not have time to explore the MPP/Columnar stores or get deep into how, idea was to convey – why/when and the possible impact. I also did not get into amazon store(s) or coherence, inmemory db etc.

I chose Redis – as it brings familiarity of memcached/velocity, Riak because it is sort of everything from kv,document to search store and ability to add/delete nodes is simple/powerful. I chose MongoDB to just de-mystify the storage – access via indexes/mr of javascript.

Whole idea was to help folks contrast these stores with Relational store from pov of what they are comfortable with (indexes, joins, acid, schema,
monitoring, management). One of the simplest way to skin is to ask question about range queries, index/updates.

Here is the link to the presentation - http://speakerdeck.com/u/govind/p/non-relational-storage-for-relational-person