Name of an asset does not have any sanctity. No seriously.
Following are the file names for missing folks for last 2 years in Karanataka.
Click to access Feb_2015_Missing_report.pdf
Click to access Jan_2015_Missing_report.pdf
Click to access Dec_2014_Missing_report.pdf
Click to access Nov_2014_Missing_report.pdf
Click to access Oct_2014_Missing_report.pdf
Click to access SEP_2014_Missing_report.pdf
Click to access Missing_Report_Aug_2014.pdf
Click to access July_Missing_Report_2014.pdf
Click to access May_Missing_Report_2014.pdf
Click to access April_Missing_Report_2014.pdf
Click to access Missing_Report_March_2014.pdf
Click to access Missing_Report_Feb_2014.pdf
Click to access Missing_Report_Jan_2014.pdf
Click to access Missing_Report_December_2013.pdf
Click to access Missing_Report_November_2013.pdf
Click to access Missing_Report_October_2013.pdf
Click to access Missing_Report_September_2013.pdf
Click to access Missing_Report_August_2013.pdf
Click to access Missing_Report_July_2013.pdf
Click to access Missing_Report_June_2013.pdf
Click to access May_missing_13.pdf
Click to access April_Missing_2013.pdf
Click to access March_Missing_2013.pdf
They all respond to the url http://www.ksp.gov.in/home/crime/udr.php with structure in pic . Note the names and the “links” to “missing”. Sadly actually file names mismatch, Big Deal ?
Absolutely not for a person who loves cleaning the data , a sort of OCD. This is like God Sent. “One thing you had to do right”.
So what is broken? Process or Tool. There is definite issue of simplicity of “naming convention” and following it. Why do people people forget it, because they are evil? no because our tools make it difficult for them to contextualize the work at hand and follow all “implied” rules.
There is PDF, DOC, XLS world which hides the data and then there is data in html files. Absolutely priceless. Thanks to import.io – I can at least do these things in jiffy to identify what I am getting out and the pattern.
Update – pdf extractors I use/try.
Excel and word – 1st. Excel for structured data and Word otherwise.
Apache PDFBox – Download page: http://pdfbox.apache.org/downloads.html.
Tabula – Download page: http://tabula.nerdpower.org.
PDF Extraction Toolkit – Download Page:http://tamirhassan.com/pdfxtk.html.
Poppler – Download page: http://poppler.freedesktop.org/
PDF2XML – Download Page: http://sourceforge.net/projects/pdf2xml/
Xpdf – Download Page: http://www.foolabs.com/xpdf/