Fun scraping with casperjs and phantomjs

Recently, I have been playing around with CasperJS and PhantomJS for web scraping. I always find screen scraping fun and fascinating. I mean there are just so many applications:

  1. We have bills/accounts all over the place in different websites. The scraping tools can be used to develop a program for personal use that can combine the results in a single place. It also can be used to trigger notifications e.g. bill payments reminder, manga notification, movies notification. The possibility is just endless :)

  2. We want to find and compare the prices among different websites quickly. It would be wonderful to find out the cheapest air/hotel fare or holidays package without going through multiple sites like zuji, agoda, expedia, etc.

In the ideal world, all the websites would offer clean APIs and integrate with each other. But most of the time, this is not the case. The data is sure legal for personal non-commercial use :)

Ok, back to CasperJS and PhantomJS. PhantomJS is basically a headless Webkit browser without UI. It is not exactly like Chrome though as they makes use of the Qt Webkit codebase. As the library is still in early stage, there are some bugs/inconsistencies among different platforms. I have problems with generating screenshots in Centos 6 while the same code works absolutely fine in my Macbook Pro. Below is the screenshot issue I have when running the tech news example

I see 3 major applications for PhantomJS: screen scraping, generating screenshots for portions of your website and designing of auto-bot :) While PhantomJS offers plenty of useful APIs with cookies management (WOW!! you can control login for multiple websites with different cookies easily now) , CasperJS did simplify a lot of tasks if we want to develop a screen scraping bot that simulates the behavior of a real person.

CasperJS offers a Javascript steps library (similar to Async.js) that makes our code more readable. In addition, it simplified form filling, clicking of button and stuff. As PhantomJS is actually a browser, we can design/inject scripts that manipulates the page HTML, waiting for the browser to finish loading all their Javascript before parsing. It is impossible to screen scrape certain AJAX sites with only curl as there is no Javascript interpreter. The cool thing is we can even inject socket.io client script into the pages to communicate and pass back data to our server if necessary!

At this point, my only complaint is that CasperJS/PhantomJS takes too much memory. It takes 100MB to load a simple site that I test. I already disable images/script loading. This performance is unacceptable and makes it very hard to do scraping/testing simultaneously. To overcome this issue, we would need to inject iframe scripts to load multiple pages. Alternately, we can load the pages , then pass back the cookies, html for an external server (PHP or node.js) for further processing. Overall, I love this library :) It is probably one of the most interesting project that I have seen after Node.js & Socket.io.

comments powered by Disqus