Scraper Breakers

Name of an asset does not have any sanctity. No seriously.

Following are the file names for missing folks for last 2 years in Karanataka.

They all respond to the url with structure in pic . Note the names and the “links” to “missing”. Sadly actually file names mismatch, Big Deal ?

Absolutely not for a person who loves cleaning the data , a sort of OCD. This is like God Sent. “One thing you had to do right”.

So what is broken? Process or Tool. There is definite issue of simplicity of “naming convention” and following it. Why do people people forget it, because they are evil? no because our tools make it difficult for them to contextualize the work at hand and follow all “implied” rules.


There is PDF, DOC, XLS world which hides the data and then there is data in html files. Absolutely priceless. Thanks to – I can at least do these things in jiffy to identify what I am getting out and the pattern.

Update – pdf extractors I use/try.

Excel and word – 1st. Excel for structured data and Word otherwise.

 Apache PDFBox – Download page:

Tabula – Download page:

PDF Extraction Toolkit – Download Page:

Poppler –  Download page:

PDF2XML – Download Page:

Xpdf  – Download Page:

Scraper Breakers

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s