Heard another RTI activist was shot in Ahmedabad. Wanted to lookup # of people killed/attacked for RTI activity. Unfortunately lot of results talk about same incident. I am interested in unique incidents, I achieved the same by changing the date range – a kludge. Crawling should not be the only thing search engines should do, they have enough information to cluster results into a “bunch” which refer to same incident.
I can see the challenge though where an article refers to older incident thus affecting the ability to cluster. Another challenge would be an article summarizing the attacks on RTI applicants over years but these will be very few anyway.
Apparently google predict account/apache mahout should enable this. What is involved?
1. Identifying groups/clusters
2. Cluster identification on common name, date, location (extract date of crawl, date of publishing of article, source, name of person involved in attack or death. Name extraction can be accomplished by phrase analysis of sentence mentioning the two words nearby each other and looking up subject. Similar thing that needs to be done in Delhi traffice police tweets. Location extraction also is part of earlier process. If trusted sources provide the data – a pattern can be assumed – Location. Date. “Phrase describing the incident” – followed by lot of extraneous information.
3. Disambiguating based on exclusive notifcation rather than summary/article on scores of similar incidents
Trusted source information – rather than looking at all results – search should be directed across 2-3 places and information extracted from them. reuters/bbc/pti/xinhua.
Must mention – YQL/Yahoo pipes are like dynamic languages – instant gratification and visible work.
BTW must mention – youtube’s “featured video” suggests “cannibalism video from bbc” while watching videos on Vitthal (God). Could not explain to mom – exactly what was the science behind that one :).