Scraper Breakers

Name of an asset does not have any sanctity. No seriously.

Following are the file names for missing folks for last 2 years in Karanataka.

http://www.ksp.gov.in/download/Feb_2015_Missing_report.pdf
http://www.ksp.gov.in/download/Jan_2015_Missing_report.pdf
http://www.ksp.gov.in/download/Dec_2014_Missing_report.pdf
http://www.ksp.gov.in/download/Nov_2014_Missing_report.pdf
http://www.ksp.gov.in/download/Oct_2014_Missing_report.pdf
http://www.ksp.gov.in/download/SEP_2014_Missing_report.pdf
http://www.ksp.gov.in/download/Missing_Report_Aug_2014.pdf
http://www.ksp.gov.in/download/July_Missing_Report_2014.pdf
http://www.ksp.gov.in/download/June_Missing_Report_2014.rar
http://www.ksp.gov.in/download/May_Missing_Report_2014.pdf
http://www.ksp.gov.in/download/April_Missing_Report_2014.pdf
http://www.ksp.gov.in/download/Missing_Report_March_2014.pdf
http://www.ksp.gov.in/download/Missing_Report_Feb_2014.pdf
http://www.ksp.gov.in/download/Missing_Report_Jan_2014.pdf
http://www.ksp.gov.in/download/Missing_Report_December_2013.pdf
http://www.ksp.gov.in/download/Old%20Missing%20Data%20of%20UDR%20&%20Missing%20GZT.rar
http://www.ksp.gov.in/download/Missing_Report_November_2013.pdf
http://www.ksp.gov.in/download/Missing_Report_October_2013.pdf
http://www.ksp.gov.in/download/Missing_Report_September_2013.pdf
http://www.ksp.gov.in/download/Missing_Report_August_2013.pdf
http://www.ksp.gov.in/download/Missing_Report_July_2013.pdf
http://www.ksp.gov.in/download/Missing_Report_June_2013.pdf
http://www.ksp.gov.in/download/May_missing_13.pdf
http://www.ksp.gov.in/download/April_Missing_2013.pdf
http://www.ksp.gov.in/download/March_Missing_2013.pdf

They all respond to the url http://www.ksp.gov.in/home/crime/udr.php with structure in pic . Note the names and the “links” to “missing”. Sadly actually file names mismatch, Big Deal ?

Absolutely not for a person who loves cleaning the data , a sort of OCD. This is like God Sent. “One thing you had to do right”.

So what is broken? Process or Tool. There is definite issue of simplicity of “naming convention” and following it. Why do people people forget it, because they are evil? no because our tools make it difficult for them to contextualize the work at hand and follow all “implied” rules.

Missing-report

There is PDF, DOC, XLS world which hides the data and then there is data in html files. Absolutely priceless. Thanks to import.io – I can at least do these things in jiffy to identify what I am getting out and the pattern.

Update – pdf extractors I use/try.

Excel and word – 1st. Excel for structured data and Word otherwise.

 Apache PDFBox – Download page: http://pdfbox.apache.org/downloads.html.

Tabula – Download page: http://tabula.nerdpower.org.

PDF Extraction Toolkit – Download Page:http://tamirhassan.com/pdfxtk.html.

Poppler –  Download page: http://poppler.freedesktop.org/

PDF2XML – Download Page: http://sourceforge.net/projects/pdf2xml/

Xpdf  – Download Page: http://www.foolabs.com/xpdf/

Scraper Breakers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s