RTI (KILL OR ATTACK) – Cluster search results? Google Predict/Apache Mahout?

Heard another RTI activist was shot in Ahmedabad. Wanted to lookup # of people killed/attacked for RTI activity. Unfortunately lot of results talk about same incident. I am interested in unique incidents, I achieved the same by changing the date range – a kludge.  Crawling should not be the only thing search engines should do, they have enough information to cluster results into a “bunch” which refer to same incident.

I can see the challenge though where an article refers to older incident thus affecting the ability to cluster. Another challenge would be an article summarizing the attacks on RTI applicants over years but these will be very few anyway.

Apparently google predict account/apache mahout should enable this. What is involved?

1. Identifying groups/clusters

2. Cluster identification on common name, date, location (extract date of crawl, date of publishing of article, source, name of person involved in attack or death. Name extraction can be accomplished by phrase analysis of sentence mentioning the two words nearby each other and looking up subject. Similar thing that needs to be done in Delhi traffice police tweets. Location extraction also is part of earlier process. If trusted sources provide the data – a pattern can be assumed – Location. Date. “Phrase describing the incident” – followed by lot of extraneous information.

3. Disambiguating based on exclusive notifcation rather than summary/article on scores of similar incidents

Trusted source information – rather than looking at all results – search should be directed across 2-3 places and information extracted from them. reuters/bbc/pti/xinhua.

Must mention – YQL/Yahoo pipes are like dynamic languages – instant gratification and visible work.

BTW must mention – youtube’s “featured video” suggests “cannibalism video from bbc” while watching videos on Vitthal (God).  Could not explain to mom – exactly what was the science behind that one :) .

cannibalism video - featured while watching god's video

Screwed up "featured video"


Delhi traffic police tweet analysis

Unfortunately there is no normalized data for locations, places, landmarks. This prevents gettin the geodata for poi and plotting them on map. Thus giving better picture of accident/prone, maximum diversion locations across temporal extract. Results can be little better with analysis of sentence like verb(obstruction/diversion) at location. One can also find out when things cleared and look at average time taken to clear up incidents.

Words which come together :

Traffic obstruction; broken down.; may contact; specific suggestions/complaints; broke down.; area specific; Raja Garden; Range

having…; DHAULA KUAN; red light; truck broke; concerned DCP/Traffic; given below:…; DTC bus; one truck; HTV broke; bus broke; flyover towards; website http://delhitrafficpolice.nic.in; Traffic heavy;  floor Bus; towards Gurgaon; towards Ashram; 1700 Hrs; Akshardham mandir; Budha Jayanti; CHOWKI NO.; G.C. Dwivedi; Jal Board

Word frequency

traffic                   ->                       157
obstruction               ->                       122
towards                   ->                        99
down.                     ->                        63
broken                    ->                        46
truck                     ->                        44
broke                     ->                        42
there                     ->                        31
flyover                   ->                        30
heavy                     ->                        17
nagar                     ->                        12
chowk                     ->                        11
contact                   ->                        11
garden                    ->                        11
specific                  ->                        11
suggestions/complaints,   ->                        11
container                 ->                        10
dhaula                    ->                        10
wazirabad                 ->                        10
ashram                    ->                         9
delhi                     ->                         9
vihar                     ->                         9
naraina                   ->                         8
cleared                   ->                         7
gurgaon                   ->                         7
hospital                  ->                         7
airport                   ->                         6
between                   ->                         6
having…                 ->                         6
mayapuri                  ->                         6
range                     ->                         6
staff                     ->                         6
aiims                     ->                         5
below:…                 ->                         5
break                     ->                         5
cleared.                  ->                         5
concerned                 ->                         5
dcp/traffic               ->                         5
e-mail                    ->                         5
flyover.                  ->                         5
given                     ->                         5
lajpat                    ->                         5
light                     ->                         5

Tweets with “red light” (shows I need to use new api to get time and complete text with help of tweet id) – 7 matches

has been cleared at Kalindi kunj red light   Traffic obstruction at kalandi kun
ffic obstruction at kalandi kunj red light due to one truck broken down   Traff
rk    Traffic heavy at Wazirabad red light from Burari towards ISBT due to heav
ffic obstruction at SHASTRI PARK red light due to a truck broken down  http //b
J  Traffic obstruction at ASHRAM RED light due to a LGV broke down  http //bit
both carriageway at Subroto park red light  http //bit ly/d8Wilc  Traffic jam a
Rohini West Metro station due to red light point fault  http //bit ly/c0fXIg  T

Tweets with “obstruction” 25 of 122 matches:
rgaon  has been removed   Traffic obstruction at MADHUBAN CHOWK TOWARDS PEERA G
GAS TANKER broken down   Traffic obstruction between naraina flyover to mayapu
ue to a HTV broken down   Traffic obstruction at Subroto Park towards Gurgaon a
crane has been directed   Traffic obstruction at Ladosarai T-point towards Khan
oved traffic normalized   Traffic obstruction at Ladosarai T-point towards Khan
due to a tree uprooted   Traffic obstruction from Naraina to mayapuri has been
oved traffic normalized   Traffic obstruction at Majnu ka tilla towards Wazirab
irabad  has been removed  Traffic obstruction from Naraina to mayapuri due to a
a container broken down   Traffic obstruction at Majnu ka tilla towards Wazirab
a container broken down   Traffic obstruction has been cleared at Dhaula Kuan
n  21 19 PM 07-July-2010  Traffic obstruction at RTR flyover underpass towards
irport has been cleared   Traffic obstruction at  RTR vacant vihar  IIT towards
rport  has been cleared   Traffic obstruction at savitri flyover towards Nehru
u place has been removed  Traffic obstruction at RTR flyover underpass towards
damper has broken down   Traffic obstruction at  RTR vacant vihar  IIT towards
damper has broken down   Traffic obstruction at savitri flyover towards Nehru
ntainer has broken down   Traffic obstruction at Birtanya choke towards Punjabi
ue to one truck has broken down   obstruction has been cleared at Wazirabad fly
ed at Wazirabad flyover   Traffic obstruction at Wazerabad flyover due to one t
e truck has broken down   Traffic obstruction at Raja Garden chowk from Tilak n
Tanker has broken down   Traffic obstruction has been cleared at Kalindi kunj
Kalindi kunj red light   Traffic obstruction at kalandi kunj red light due to
over construction work    Traffic obstruction at NAGIA PARK  CHOWKI NO  2 SHAKT

Challenge is normalized names (no spelling differences etc) and then ability to pull their geolocation from one of the apis(mostly google) and plot them back with heat map kind of thing to point out bottlneck across a day.

Looks like Naraina, Dhaula kuan, RTR flyover,Wazirabad red light, Kalindi kunj red light ,Raja Garden,NAGIA PARK are choke points. More detailed analysis to follow on another day with locations and kind of obstruction, accidents and hopefully geolocations on map.

Finally must get out the blogger crawler for solr :) what else. And also finish getting data about dtp’s fb postings of people who are booked and for hopefully for what offence. This data can be far easily posted by delhi police themselves in form of odata for future usage rather than just fb posts or tweets.

Overall Delhi police must be congratulated for following up on citizen requests and posting action back to fb. Nothing expensive stuff to maintain :) or buy except mobile phones equipped with fb/twitter app. Sure there is some nic app which is doing the storage of offenders etc. But interaction through twitter/fb is outright simple.


Follow

Get every new post delivered to your Inbox.