Making sense of public org on facebook.

Simple task – Find out who Delhi traffic police has investigated and in how much time.

Where is the information – Right now on FB.

How can one get it

It should be simple, it is public data. You just get all the posts of Delhi traffic police spend few minutes in formulating a regular expression to extract the number plates. And follow it up by co-relating it to time it was reported and time it was “posted as action taken”. You can do fancier things by extracting entities and posting them back on map. But to start with access to reports and corresponding action taken post will be enough.

Challenge – FB’s dev policy unfortunately does not allow it :( or at least I could not make enough sense of the policy doc. I could not get a response from them and waited for reasonable time  before posting it.

Conclusion

Twitter is better friend here.

Search firms should be allowed to gather posts of “recognized” public  entities – in this case delhi traffic police. There need not be any access to its network of people who are “friends” with it or “like it”.


RTI (KILL OR ATTACK) – Cluster search results? Google Predict/Apache Mahout?

Heard another RTI activist was shot in Ahmedabad. Wanted to lookup # of people killed/attacked for RTI activity. Unfortunately lot of results talk about same incident. I am interested in unique incidents, I achieved the same by changing the date range – a kludge.  Crawling should not be the only thing search engines should do, they have enough information to cluster results into a “bunch” which refer to same incident.

I can see the challenge though where an article refers to older incident thus affecting the ability to cluster. Another challenge would be an article summarizing the attacks on RTI applicants over years but these will be very few anyway.

Apparently google predict account/apache mahout should enable this. What is involved?

1. Identifying groups/clusters

2. Cluster identification on common name, date, location (extract date of crawl, date of publishing of article, source, name of person involved in attack or death. Name extraction can be accomplished by phrase analysis of sentence mentioning the two words nearby each other and looking up subject. Similar thing that needs to be done in Delhi traffice police tweets. Location extraction also is part of earlier process. If trusted sources provide the data – a pattern can be assumed – Location. Date. “Phrase describing the incident” – followed by lot of extraneous information.

3. Disambiguating based on exclusive notifcation rather than summary/article on scores of similar incidents

Trusted source information – rather than looking at all results – search should be directed across 2-3 places and information extracted from them. reuters/bbc/pti/xinhua.

Must mention – YQL/Yahoo pipes are like dynamic languages – instant gratification and visible work.

BTW must mention – youtube’s “featured video” suggests “cannibalism video from bbc” while watching videos on Vitthal (God).  Could not explain to mom – exactly what was the science behind that one :) .

cannibalism video - featured while watching god's video

Screwed up "featured video"


Irreverant post of irreverant task – tweet analysis

This post was planned at the end of last year with some weekend hours on nltk and intention of comparing with Lingpipe/Gate. Unfortunately the latter exercise never finished.

In general I have gone ahead and cleaned up some adjectives, articles, preposition etc which do not add any value in analysis. Idea was to see if it is possible to find out more about a person from tweets and. This was done at end of 2009 but got around to writing an analysis just now.

What are the pair of words which get most mention?

Tweetid Word pair
Gulpanag & please visit; visit http://gulpanag.net; Col Shamsher; thank you:; happy birthday; Sub Himalayan; Shamsher Singh; Whatever happened; welcome aboard; long term; highly motivated; #skoda #yeti; cotton farmers; district #Fatehgarhsaheb; films please; #ndtvtechlife awards; sub committee; right wing; Looking forward; last night;

#mulitpolar #balanceofpower; ‘Chad Gaya; @pavanduggal: #indiancyberlaw; @viveksmisra HTC; Berlin wall; Gaya upar; Hindi-vear; Hot Tub; Rajya Sabha

Iamsrk * insha allah; dont know; looking forward; red chillies; feel awful; low fat; human being.; take care; six pack; sleep now.; really nice.; access version; good enuff.; happy birthday; really nice; many things; keep awake; mani sirs; medicinal properties; pradeep dhoot; push ups
Shashitharoor Lok Sabha; Rajya Sabha; Chandran Tharoor; copy jacob@tharoor.in; 140 characters; Fareed Zakaria; last night; old friend; don’t know; Irfan Pathan; Passport Seva; yrs ago; Shihab Thangal; Foreign Minister; Indian community; Rashtrapati Bhavan; Seva Kendras; next year; Kapil Sibal; Indira Gandhi; Excellent mtg; Indian Ocean; haven’t seen;

latest story; Tharoor Foundation; long day; roster duty; Youth Congress; pls write; foreign policy

Virsanghvi # via @addthis; Many thanks; Rude Food; @addthis Rude; @addthis Counterpoint; Bal Thackeray; Parallax View; couldn’t agree; Vir Sanghvi:;; Manu Sharma; Shiv Sena; Raj Thackeray;Amar Singh; agree entirely; good one; foreign policy; @addthis Pursuits; Discovery Travel; Buck Stops; don’t think; next time; Fair enough.; Air India; Shashi Tharoor; Indira Gandhi; Custom Made; much. Glad; upset stomachs; good idea.
Sardesairajdeep~ cnn ibn; write in.; shiv sena; amar singh; don’t miss; per cent; karan thapar; devils advocate; shashi tharoor; aam admi; sri lanka; lalit modi; news docu; ibn lokmat; prime time; real heroes.; three idiots; womens bill; big story:; must watch; looks like; breaking news:; big story; must confess; indo-pak talks; udhav thackeray; mani shankar; phone tapping
Rajivmakhni gadget guru; Kareena Kapoor; ndtv 24×7; Steve Waugh; Xperia X10; ndtv profit; iPhone 3GS; notion ink; size zero; next week; Gadget Guru; much appreciated; NDTV 24×7.; Windows Phone; #ndtvtechlife awards; number portability; sea link; NDTV 24×7; Newsnet today; quick review; @ankitv @abhishektelang; proud moment; ink adam; awards jury; ndtv india; newsnet today; Apple iPad; 100 invites; @RahulDX @ankitv; super thin
Lintool #hadoop #mapreduce; Maryland: http://bit.ly/caJi3h; looking forward; @brandynwhite talking; compute vision;

#Hadoop #MapReduce; vision applications; language processing; natural language; object creation; #hadoop #MapReduce; #mapreduce HUG; “explode” mapper; @cloudera @philz42; Marginal Relevance; Maximal Marginal; [pwsim algo]; interactive Maximal; key space; scale natural; 8GB input; FAST 2010; large scale; @kevinweil

@abdur; input key; seem like; #Hadoop #MapReduce:; @deliprao @ChrisDiehl; it’s like

Jboner$ Looking forward; Scala Days; available: http://bit.ly/9HzbUn; ‘receive’ method.; @RayRoestenburg: Blogged; Message Routing:; Patterns’: http://is.gd/c5itt; Routing: Part; art: http://bit.ly/civmvk; highly concurrent; main diff; last night; Great talk; Actor’s ‘receive’; Actor’s logic; Stability Patterns’:; That’s true.; @remeniuk Akka; Akka  HotSwap; real state; @pavlobaron @mknittig; @ManoMarks thanks; #Akka Message; ; @djspiewak Multiverse; good idea.; much better; would love
Tunkuv Carl Levin; David Frum; Daily Beast; Charles Leerhsen; Hurt Locker; Parties http://www.thedailybeast.com/blogs-and-stories/2010-04-15/at-last-the-truth-about-tea-partiers/; Tea Parties; all-volunteer army;

Paul Stevens; dear old; foreign policy; illegal immigration; turns readers; vituperative morons; online medium; John Batchelor; New York; Milton Friedman; doesn’t mean; #healthcare reform; John Paul; can’t decide; Show tonight

Abdur% Red Dwarf; Anyone else; Looking forward; next time; pretty cool; Happy Birthday; Indian food; new years; pretty sure; taxi magic; Pretty cool; sit next; cant wait; late night; Jack Falstaff; Steve Blank; Stinson Beach; Web 2.0; fresh wasabi; pie chart; last night; San Francisco; Wine tasting; alarm clock; dim sum; kitty hospital; Great dinner; screwing us.; rental cars;

&removed- SJOBA Sub

*removed – kwon doh; lahe wah; minal lahe; nasrun minal; taekwon; wah fatahun; artemis fowl; dash hulk; hulk kai;

#removed – @addthis Parallax

~removed – cnn ibn:; cnn ibn.;

%removed-”The Art

$removed- I’ll add; I’m sure; blog post

Strawman Analysis :

Abdur/Lintool sort of belong to one extreme of folks analyzed. They could not be more different. Abdur – coming from twitter-posts very little information processing/analysis stuff and it is mostly about food, wine and personal stuff. Lintool clearly comes out as the hadoop-mapreduce man. Abdur also comes out harassed by united, taxi.

Jboner undoubtedly provides credential to his scala /akka background.  He is passionate about akka, scala and his friends @debashishg, @viktorlang, @sbtourist.  Akka goes nicely with scalability, availability, actor, concurreny.  He also points out lot of presentations and slide decks in general.

Tunkuv comes out swinging as conservative who is hooked to cricket and still has patience to go through cricinfo and blast them for shoddy writing.  Being a conservative he also has views on the healthcare and policies of the us administration. Off course tea party is the slogan of the opposition so makes an appearance.

Rajiv Makhni is no doubt “the gadget man”. He is also part of the media crowd who put “breadcrumbs” to their tv shows, writings or in general fellow folks. There is lot of link back to parent firm – NDTV and its various shows that is.  He pretty much looks obsessed with phones, devices like ipad/tablet, dimensions and features.  Good thing about Rajiv is that he does most of the hard work and in general not immediately vowed by apple, Microsoft. His latest coverage about “other” brands of phones is genuine attempt to provide voice to alternative (micromax).

Rajdeep Sardesai too talks about his channel, upcoming show(requests to watch with timings and all)  and then some comments on current happenings across amar singh, thackreys, shashitharoor(man did the media guys milk him?). He also points out Sagarika’s articles from time to time. This is apparent in urls that he points too. I have not shared them here but extraction of urls and crawl to them provided the base for that conclusion.

Virsanghvi is unique. Virsanghvi has hands full with pointers to his articles via @addthis. He does cover wider array of topics from Padma Bhushan fiasco , headley, china/Pakistan and love of thackrey/Pawar family. I wished I could do sentiment analysis of those tweets most of which are full of sarcasm.  He made it a point during  Tharoor gate to pin Tharoor followers with his unique insights on folly of being Tharoor and his followers. He followed that with one page writeup in of the magazines basically saying Tharoor was pretty much product of vanity. He also made it personal wrt Mumbai vs Bombay and gave two hoots to the fact that he or his sort who speak French or Italian in 2 day visit that they go to those countries but consider it below themselves to learn a local language(may be he will pick up a cue from Aamir Khan who is learning it or Rajnikant who learnt/mastered local language). In case of food  HOMP on “ndtv good times” guys take his happiness in terms of genuine freshness.

ShashiTharoor – the man who was brought down by twitter unfortunately was a judicious user of tweeter. He posted his meetings, addresses and visits to his constituency, state on tweets. One can get a glimpse of his work just through his tweets. Man was transparent. He did have cricket on his radar though. Unlike media folks he did not pre-judge and pass comments or forwarded to his blog/writing etc. Only time he opened about personal life was about his son and his achievements just like any other father would. His ministerial work too gets prominence though he did keep his boundary wrt to foreign affair. One can clearly see man had India on his mind(this is without stemming so could be really lot more).  He too comes out pretty savvy re-tweeter. But his focus is clear – India, Kerala, meetings, visits and foreign affairs.  Cricket does popup but just engulfed by all the meetings, briefings and mentions about generally issues that is tackling or good things he sees.

Shahrukh comes out as genuine person who wants to connect with his fans and shares tips to youngsters about good/bad habits and need to focus on work. He is extremely dedicated to his kids and family and that comes out pretty strongly. His faith also makes appearance through his invocation of insha allah phrase(basically as/if god wishes). His sleep pattern makes an appearance and one tweet will be about the need of sleep in early morning when he goes to sleep. Exercise, healthy food and fitness make way on the tweets at times. His firm red chillies, his team kkr make brief seasonal appearance. He is also the only one who dwells on his past – mother and father – again re-affirming his attachments.  His language also comes out pretty dominated with verbs and action reflecting his persona? Shahrukh as mentioned earlier comes out swinging for family(kids), friends(karan), cities(Mumbai, Delhi) and his present  disposition (Kolkata knight riders or movie ra.one).

Gulpanag comes across as refreshing change with interest in riding, biking, automobiles and across issues of the time. This ranges from “army scam”, “farmer plight”, “girl child”, eco friendliness and in general fitness (running,bike).

There is gentleman whose tweet analysis I have not done in detail here. But Pankaj Pachauri comes across as genuine person with nuanced interests across varied subjects. His tweets and his programmes are both high quality. He is very few of the journalist who presents facts and analyzes situation of price increase, local manufacturing,  Africa mining, gas price increase, education cost and relation to enterprenuer ship. But unfortunately as it would happen his shows are mistimed and like the regional newspapers and good folks there he is ignored.

The word collocation as the above exercise is called can be easily combined with actual word frequency across tweets. In this case I have not taken trouble to normalize the data with synonyms, homonyms or polynyms. I just intersected across corpus of wordnet and brown. There is slight difference between frequencies of regular text tokenization and custom tokenization. For our purpose we will ignore regular text tokenization.  Name and place extraction via alchemy, evri did not work very well, nor did sentiment analysis for singular tweets. The other challenge was lot of new “sms/tweeter” words like hv, 4, lrnt, mv etc. These words need to be in dictionary replaced with actual words. Another idea would be – do common factor search across a bunch of folks (via # tags first and then other simpler things)

NLTK is useful kit to do quick 2 hrs work. It has its quirks and I have not exploited its whole NLP side of the  world. One of the days J.

Tweets downloaded in Dec 2009 and appended in May 2010.

Twitter api – http://code.google.com/p/python-twitter/ (python 2.6)

How many times word was tweeted (rough and not complete for space)

Tweet Id Word frequency
Gulpanag @gulpanag                 ->                        39

please                    ->                        38

http://gulpanag.net       ->                        37

don’t                     ->                        35

about                     ->                        32

visit                     ->                        30

think                     ->                        26

being                     ->                        25

should                    ->                        25

india                     ->                        24

check                     ->                        22

thanks                    ->                        44

women                     ->                        21

twitter                   ->                        18

world                     ->                        17

always                    ->                        16

happy                     ->                        16

welcome                   ->                        16

great                     ->                        15

lol!!                     ->                        14

films                     ->                        13

indian                    ->                        13

bombay                    ->                        12

chandigarh                ->                        12

#fail                     ->                        11

#fatehgarhsaheb           ->                        11

#iphone                   ->                        10

#punjab                   ->                        10

phone                     ->                        10

really                    ->                        10

riding                    ->                        10

Iamsrk kids                      ->                        101

thanx                     ->                        70

life                      ->                        69

think                     ->                        69

sleep                     ->                        61

should                    ->                        53

great                     ->                        48

being                     ->                        47

really                    ->                        45

happy                     ->                        42

films                     ->                        36

world                     ->                        36

write                     ->                        36

always                    ->                        35

ra.one                    ->                        35

shoot                     ->                        35

people                    ->                        31

start                     ->                        31

everyone                  ->                        30

never                     ->                        30

still                     ->                        29

better                    ->                        26

friend                    ->                        26

without                   ->                        25

friends                   ->                        24

later                     ->                        23

reading                   ->                        23

thought                   ->                        23

doing                     ->                        22

sometimes                 ->                        22

time.                     ->                        22

believe                   ->                        21

film.                     ->                        21

karan                     ->                        21

looking                   ->                        21

@kkriders                 ->                        20

family                    ->                        20

kolkata                   ->                        20

mumbai                    ->                        20

didnt                     ->                        19

please                    ->                        19

sleep.                    ->                        19

sorry                     ->                        19

them.                     ->                        19

things                    ->                        19

anyone                    ->                        18

insha                     ->                        18

thing                     ->                        18

today                     ->                        18

twitter                   ->                        18

allah                     ->                        17

early                     ->                        17

makes                     ->                        17

match                     ->                        17

right                     ->                        17

trying                    ->                        17

watch                     ->                        17

which                     ->                        17

ShashiTharoor indian                    ->                       124

india                     ->                       164

kerala                    ->                       113

@shashitharoor            ->                        82

delhi                     ->                        64

great                     ->                        64

tweet                     ->                        60

visit                     ->                        55

addressed                 ->                        46

minister                  ->                        45

write                     ->                        44

dinner                    ->                        43

foreign                   ->                        43

always                    ->                        42

people                    ->                        42

world                     ->                        41

today                     ->                        40

cricket                   ->                        39

public                    ->                        37

state                     ->                        36

sorry,                    ->                        35

spoke                     ->                        35

official                  ->                        34

speech                    ->                        32

congress                  ->                        29

excellent                 ->                        29

lunch                     ->                        28

visited                   ->                        27

office@tharoor.in         ->                        26

VirSanghvi thanks                    ->                       349

@addthis                  ->                       133

think                     ->                        69

agree                     ->                        64

indian                    ->                        45

counterpoint              ->                        43

india                     ->                        43

liked                     ->                        36

thackeray                 ->                        30

point                     ->                        24

enjoyed                   ->                        23

guess                     ->                        23

media                     ->                        23

really                    ->                        22

against                   ->                        21

pakistan                  ->                        21

still                     ->                        19

@pritishnandy             ->                        18

always                    ->                        18

pilots                    ->                        18

@vinkaycee                ->                        17

delhi                     ->                        17

times                     ->                        17

legal                     ->                        16

parallax                  ->                        16

sorry                     ->                        16

26/11                     ->                        15

@thyagu2009               ->                        15

bombay                    ->                        15

fight                     ->                        15

india’s                   ->                        15

support                   ->                        15

state                     ->                        14

watching                  ->                        14

@gulpanag                 ->                        13

entirely                  ->                        13

foreign                   ->                        13

government                ->                        13

great                     ->                        13

police                    ->                        13

politicians               ->                        13

absolutely                ->                        12

action                    ->                        12

anyone                    ->                        12

certainly                 ->                        12

channels                  ->                        12

headley                   ->                        12

padma                     ->                        12

problem                   ->                        12

public                    ->                        12

twitter                   ->                        12

china                     ->                        11

happy                     ->                        11

Sardesairajdeep india                     ->                        107

indian                    ->                        53

watch                     ->                        53

tonight                   ->                        47

story                     ->                        43

mumbai                    ->                        41

today                     ->                        41

cricket                   ->                        36

great                     ->                        33

tharoor                   ->                        33

write                     ->                        33

special                   ->                        31

breaking                  ->                        30

hockey                    ->                        30

political                 ->                        29

channel                   ->                        28

years                     ->                        28

world                     ->                        25

media                     ->                        24

debate                    ->                        23

against                   ->                        22

sachin                    ->                        22

still                     ->                        21

11.30                     ->                        20

report                    ->                        20

truly                     ->                        20

guess                     ->                        19

needs                     ->                        19

three                     ->                        18

watching                  ->                        18

indo-pak                  ->                        17

terror                    ->                        17

twitter                   ->                        17

ibnlive.com               ->                        16

justice                   ->                        16

power                     ->                        16

prices                    ->                        16

sunday                    ->                        16

indians                   ->                        15

journalism                ->                        15

stories                   ->                        15

womens                    ->                        15

karan                     ->                        14

killed                    ->                        14

pakistan                  ->                        14

singh                     ->                        14

thought                   ->                        14

because                   ->                        13

delhi                     ->                        13

india’s                   ->                        13

parliament                ->                        13

party                     ->                        13

rahul                     ->                        13

violence                  ->                        13

attack                    ->                        12

budget                    ->                        12

chief                     ->                        12

RajivMakhni cellguru                  ->                        68

gadget                    ->                        44

phone                     ->                        43

newsnet                   ->                        39

today                     ->                        38

about                     ->                        33

thank                     ->                        32

great                     ->                        31

review                    ->                        26

first                     ->                        25

6.30pm                    ->                        24

iphone                    ->                        24

kareena                   ->                        23

phones                    ->                        23

india                     ->                        21

watch                     ->                        21

coming                    ->                        20

kapoor                    ->                        18

apple                     ->                        17

steve                     ->                        17

10.30pm                   ->                        16

mobile                    ->                        16

price                     ->                        16

people                    ->                        15

12.30                     ->                        13

twitter                   ->                        13

24×7.                     ->                        12

android                   ->                        12

nokia                     ->                        12

pretty                    ->                        12

profit                    ->                        12

@vikramchandra            ->                        11

market                    ->                        11

better                    ->                        10

content                   ->                        10

notion                    ->                        10

samsung                   ->                        10

tablet                    ->                        10

details                   ->                         9

phone,                    ->                         9

shows                     ->                         9

start                     ->                         9

xperia                    ->                         9

amazing                   ->                         8

garmin                    ->                         8

microsoft                 ->                         8

questions                 ->                         8

quite                     ->                         8

8.30pm                    ->                         7

Lintool #hadoop                   ->                        63

#mapreduce                ->                        57

about                     ->                        19

dryadlinq                 ->                        11

@kevinweil                ->                         9

@abdur                    ->                         8

@deliprao                 ->                         7

@ian_soboroff             ->                         7

paper                     ->                         7

@brandynwhite             ->                         5

parallel                  ->                         5

processing                ->                         5

university                ->                         5

vision                    ->                         5

Jboner #akka                     ->                        54

scala                     ->                        23

great                     ->                        21

#scala                    ->                        12

clojure                   ->                        12

concurrency               ->                        11

presentation              ->                         9

better                    ->                         8

cool.                     ->                         8

doing                     ->                         8

looking                   ->                         8

availability              ->                         6

awesome.                  ->                         6

forward                   ->                         6

interesting               ->                         6

looks                     ->                         6

support                   ->                         6

actors.                   ->                         5

after                     ->                         5

scalability               ->                         5

slides                    ->                         5

stability                 ->                         5

starting                  ->                         5

still                     ->                         5

Tunkuv #cricket                  ->                        14

#goldman                  ->                        11

#healthcare               ->                        11

piece                     ->                        11

@telegraphnews            ->                         9

about                     ->                         8

david                     ->                         8

indian                    ->                         8

#india                    ->                         6

beast                     ->                         6

cricket                   ->                         6

right                     ->                         6

@guardianbooks            ->                         5

comment                   ->                         5

human                     ->                         5

interview                 ->                         5

Abdur great                     ->                        81

twitter                   ->                        91

about                     ->                        76

thanks                    ->                        59

should                    ->                        46

happy                     ->                        40

@kevinweil                ->                        39

people                    ->                        37

going                     ->                        34

anyone                    ->                        27

reading                   ->                        25

think                     ->                        25

there                     ->                        23

pretty                    ->                        22

still                     ->                        22

watching                  ->                        22

@goldman                  ->                        21

because                   ->                        21

check                     ->                        20

today                     ->                        19

Which people get lot of mention through tweets?

Twitter id Top mentions
Gulpanag @vkaul, @nitinsgr, @nithinkd, @angadc, @achitnis, @reallybuffalo, @sonaliranade, @rwac48, @arifone     , @madhulata, @maheshmurthy, @ssarbjit, @acorn…@sherbir
iamsrk @kkriders, karan
virsanghvi @addthis, @vikram_sood, @pritishnandy, @vinkaycee, @thyagu2009, @itssotweet, @kanchangupta, @gulpanag
Sardesairajdeep @imangy; @visaraj; @arunraveen; @RohanBhade; @swapsdailydose;

@jinglebells27; @sidharth_madhav; @Varunrd; @St_Hill; @aurodip;

@MirzaSania; @jemin_p; @santheepnair; @bhogleharsha;

Rajivmakhni @vikramchandra, @ankitv, @gulpanag, @sachin_malhotra, @achitnis, @mariagorettiz
shashitharoor @shashitharoor, @23jacob, @ashwinsid: @ramgandhi52; @jaipurprince:;

@karmadude @chrisbrogan;@cricketwallah; @arpitamgupta:

@josephseb: @arungiri; @khalidalkhalifa:; @PARVEZ89:;

Jboner @jboner, @djspiewak, @debasishg, @viktorklang, @sbtourist, @pavlobaron
Lintool @kevinweil, @abdur, @deliprao, @ian_soboroff, @brandynwhite,
abdur @kevinweil, @goldman  , @gregpass, @elizabeth, @jayvirdy, @jess, @evan,@pankaj
tunkuv @telegraphnews, @guardianbooks, @prempanicker; @saliltripathi; @ultrabrown;

Is the user tweeter savvy and uses hash tags/lists?

Tweeter id Top # tags
Abdur #sgu; #justreturnzero;  #bestConfLunchOfAllTime; #chirp; #whereisbiz; #SS.; #sfcabbies; #sfcabsareslow; #spoiledbynorthercalweather; #copeyesightfail; #notquittingdayjobforstandup; #conspiracytheories; #2; #boycotunited; #tsafail; #istlecture; #tokyoiswaycool; #SantaAbuse; #ihatedish #faileddonotcalllist #endphonemarketers;

#zeroinboxfail; #mylaptopisscrewed; #3; #mustfix;#goodfoodforthought; #foo09; #lifeleasons; #Retarded; #ssm09; #sfheatfail; #fixfuckedsfcabcompanies; #TED; #7; #SQLputdowns; #cikm2008; #; #ceas

Gulpanag #quote; #shatabdi; #CWG; #Chandigarh;#amplifier; #trek; #Fatimabhutto; #ipad;

#Hockey; #t20; #delhi; #triathlon; #race; #Womenempowerment; #BRICK; #lyrics;

#jetlag; #thiruvananthapuram; #RoyalEnfield; #Kasauli; #MAC; #iPhone; #Twitter; #iPhone;

#Delhi; #AlQaeda #jem #LET;#roadrunning; #strike; #HT; #Mumbai ; #respect; #Airtel;

#fail; #sin; #sine; #indianexpress; #;#kasab; #newlylearntfact;  #orkutification; #olaytotaleffects; #Ladakh; #raiddehimalya;  #NationalGeographic; #migraine;

#wolframalpha; #LED; #Pvr;#reliance; #iPhone; #blackberry; #Hot; #tatasky;  #Surya;#ipl; #wonderwhy; #ndtvtechlife; ; #unitedindiapak; #kindness; #siemens; #stalker; #democracy; #china; #harleydavidson; #indiancyberlaw; #KingOfGoodTimes;  #harassment #; #childmarriage; #zeroaccuracy; #capitalstagnation; #africa.; #reva; #hockey

iamsrk na
Jboner #Scala; #Akka; #NodeJS; #MongoDB; #AMQP; #REST; #UnitTesting;# #javaone; #Terrastore; #jax2010; #clojure; #playframework; #camel; #github; #assembla; #maven
Lintool #SIGIR2010; #WWW2010; #MapReduce;  #Hadoop ;#1; #2; #cloudcomputing; #nlp;

#LHC; #pig ;#goog;#aws;

rajivmakhni #nowwatching; #iPhone; #BBM #Appworld; #tech-fairs;  #ndtvtechlife; #MycelebList; #ndtvgreenathon;
Shashitharoor #TWITronym; #India; #HiFlyers2009; #TED; #US; #nuclear; #NMST;#awesomeindianthings; #TEDIndia; #beatcancer; #Nobel; #Gandhi; #uighurs; #Jet2Kerala
tunkuv #iPad; #DavidHockney; #WorldT20; #cricket; #Scotus; #Euro #pakistan; #Beatles; #leadersdebate; #GordonBrown; #Goldman; #UKelections; #Arizona; #Belgium; #SpongebobSquarepants #Heimlich; #Britain; #WallStreet; #Facebook; #Samaranch; #TeaParty; #NBC; #Pulitzer;
virsanghvi na

Code for analysis could be shared at git little later or shared on request.


Picasa’s face detection is eerie

After using Picasa’s face detection tool from the time it was unveiled, I have always wondered why there is no simple api for it to be used across urls/local storage?

The way it extracts features and does the clustering even if picture is rotated/fuzzy is just awesome.  There are very few false positives. How they extract feature and normalize across rotation and fuzziness is something I would love to understand. All of this without ever training with “best picture(s)” comes across very refreshing.  If this api which I am sure all search engines have in certain degree, is made public, it will help solve lot of challenges existing today.  I have tried using few xrays, mri scans without much success though.

If you want to try/understand face detection/clustering and related stuff

- OpenCV tool -  http://www.cognotics.com/opencv/servo_2007_series/part_2/sidebar.html

One of the many research behind the technology in OpenCV-

http://research.microsoft.com/~viola/Pubs/Detect/violaJones_CVPR2001.pdf

At some point Photosynth too has to do this to accomplish sticthing across images and creating walkthroughs, point cloud A little different from face detection and matching, but nonetheless similar thing.

http://photosynth.net/view.aspx?cid=28d131ab-4cc6-4702-9278-7790f2c33cb4

Update: 16 July 2010:

Paper – http://portal.acm.org/citation.cfm?id=1290121 (Eigenfaces representation)


20 min review of latest lookup engines

This post is result of review of lookup engines done for at max 20 mins and should be taken with load of salt.

Both wolframAlpha and GoogleSquare are great attempts at compiling “data” and doing something useful with them.  Wolfram leaves better impression because of focused answer approach – you do not see links and links of information through which you need to sift.

Wolfram– It had challenge of how to crawl up massive index like others to answer search queries, so they went for specialized datasets which allow computation of certain kind on them. Process – Crunch it, hand categorize it and create charts/visuals – where possible.  Again metamodel the “known stuff” which they being in specialized field have got whole lot of algo/formulae backed up. So the they have domains like science, weather, geography where lot of data exists today.  Humans will definitely will be required to message/clean the data-information so that right “inference” can take place later on by the engine. This is apparent from http://www63.wolframalpha.com/participate/participate.html

I was real skeptic in terms of their “computational claim”.   But a scoped query shows computation might be happening. Try a query like “mars” – look at the answers – now try “distance between mars and Jupiter” , it actually computes it. It can do this with known entities with allowed operations.  (bing and google try to point to wiki, G^2 does not understand it at all)

So if wolfram is actually doing computation – that is big thing. But also look at their history – they are computation software firm. They would have loved everything neatly categorized and would love information from the search vendors about “data” queries and see where they can do “computation”.

Saving grace is – It is not a great generic search engine if you notice.  No ego surfing, Wolfram does not know you :) .

Wolfram is open about where it sources the information. It shares information about how it is interpreting the query – pretty much the first in major search engine.

G^2 – They calculated a given item how popular it was by building up the page rank and refershing it for generic phrase search. Challenge was how to utilize existing page rank and attempt categorization.  Not exactly a search engine but comparison engine (price grabber etc have done this much better for much more tangible).  So given a query it breaks it down to a known fact which can be compared across certain dimensions. This is where the magic might come in if they are doing it just based off the massive index and related url based information.(It is not apparent from simple search “pakistan india china” – same query gets interpreted as “country” and they are compared)   It adds interactivity in form of addition capability for those extra dimension.  If you add each one of the items seperately in item, you start seeing the source of information – Wikipedia.

Computation – offcourse they do 1123*2 etc. But  computation like wolfram is not possible.

But this could be great stuff for verticals like legal, pharma etc.  Ability to compare information – witness/dates. Drugs – chemicals/tests/results. If they are serious about it , they need to open it up for verticals – both the process and api.

Queries

To see the difference – type “Bank of America stock price”  in both and what do you see?

Wolfram - does extra work by looking up “data” and plotting it up. I really wonder whether they have static plot of data across  x range.

G^2 – just goes through existing pages – finds the “best bet – company !” and “tries” to categorize available information(pages data)  into semantic meaning of some sort.

Just type of Bank of America – and now compare the results.

Queries which humans really want to do

  1. when was MSFT more than 40 $” – there is data – but who will interpret it? (wolfram, google square)
  2. But you can do basic queries like “rainfall in Delhi this month” (wolfram is better, google square does not what to do here)
  3. But not – “what is the average rainfall in Delhi this month”.  (Google square has got more resources to refer to compared to launch, wolfram attempts to break down the location)

Again try putting these queries through all engines and see the result. Wolfram comes out strong where ever there is structured data available.  It would be prudent to see the “input interpretation” tag of wolfram to get the clues what they are doing and how granular can they go.

Bing

Bing’s embedded engine – powerset – hopefully does more than left side filters for queries and for domain like music/movies/ travel ––  certainly does not do justice to its fame. It can hopefully combine with FAST for interesting work in future.

Where bing shines at present -

1. Entertainment search (look at the left side filters to provide instant access to right information).

2. Travel – This is visible when using US as location and executing searches like flights from seattle to san francisco

See the answer at top which indicates how are fares expected to behave. Click on that link and you get the mashup (integration from farecast etc).

3. Quality of information – Both Bing and Yahoo provide top links from well vetted sources compared to Google which provides information from user generated content. This is more relevant for recent news, healthcare related searches where correct source is important.

4. Then the local search is where it shines specially when used from mobile. I use it now all the time. It provides locally relevant information.

IMHO – Machine learning is not so advanced yet to make sense out of presentation data(locked in js/html unless it is freed up and ontology prevails over the content)

The background information for the facts based engine are below.

  1. http://start.csail.mit.edu/
  2. http://www.trueknowledge.com/ – look at the kind of searches-Q/A you can do – look at the narrow domain.  Try their addin in FF for bing and exploring it.
    1. http://blog.trueknowledge.com/2009/05/how-to-build-a-universal-answer-engine-ten-vital-principles.html
    2. http://www.cyc.com – wolfram and other q/a engine inspiration.
    3. http://www.twine.com/technology – to see attempts of crowdsourcing the existing information and hiding rdf/owl details.
      1. If stuff is present in rdf/owl format – we could possibly use sparql – http://www.w3.org/TR/rdf-sparql-query/ . But this is long way off. Check out – http://sindice.com/map and you see some of the “categorization” happening.
      2. Just a side note – oracle decided to support triples natively – http://www.oracle.com/technology/tech/semantic_technologies/index.html

Queries to try (Across search engines)

Note how different search engines behave –

wolfram tries to identify – temporal information, location, color etc etc based on sources/domain it knows about.

google/bing bring up # of pages containing phrase.

TK’s answer provides clue to how it actually classifies the query itself across its known basket ontological buckets.

  1. Who was ruler of china  in 1000 AD – (click on More… in Wolfram to see/guess at what is really happening.  You can only query what you have J. )
  2. Children population in China  -qualify with year
  3. When did Rajiv Gandhi_die ?  Only TK attempts an answer  (http://www.trueknowledge.com/q/when_did_Rajiv_Gandhi_die)  – see the behind the scene working – when it shows the question/fact etc.
  4. when did ins khukri sink (again google/bing bring up tons of links) – no answer(TK is honest – no database of facts)

So it boils down to what are we looking for “facts/answers” or just “information from which we want to draw answers”.  True answers/facts are possible when there is valid data (birthday/location/event etc ) or set of people organize incoming data into massive fact engine(TK/Wolfram). Then inference engine can sift through the fact or computation engine(Wolfram) can try to provide absolute answers/facts. “Phrase lookup” will be well served by “Googles and Bings”.

It is all good competition for the end consumer. These will be complementary offerings and need lot of polish. Dream would be sparql application on existing data. Your data from linkedin could be picked in and associated with publication , opinion on forum, photo on flickr/picassa in more authoritative way. But most probably it will remain a dream. Chetan Kunte is mighty impressed with Wolfram and his prompted me to look in this detail

Interesting facts -

Sergey Brin interned at Wolfram.

Wolfram  generates strong opinions.

http://www.americanscientist.org/issues/id.3261,y.0,no.,content.true,page.2,css.print/issue.aspx (more balanced)

http://www.cscs.umich.edu/~crshalizi/reviews/wolfram/ (crude )

http://chrishecker.com/Kurt_Gödel_is_Laughing_His_Ass_Off_Right_Now


Follow

Get every new post delivered to your Inbox.