Posts tagged with web crawling

  • List of open source screen scraping tools

    I love data scraping :) So far I have heard about the following scraping tools that I will hopefully have time to look at more of them in the future:

    PHP Scraping Tools

    Java Scraping Tools


    • CasperJS, PhantomJS
    • Readability
    • Node.js request
    • Cheerio

    Screen Scraping APIs

    I have only used curl, simplehtmldom, tidy, casperjs/phantomjs. It is quite easy and straighforward. Diffbot...

    Read More
  • Dealing with dynamic dropdowns with CasperJS

    In my experiment with CasperJS to extract the data from an aspx page, I faced some issues with dynamic drop-down. What happened is that there can be 2-3 dropdowns box that depend on each other e.g. User selects a category in dropdown1 , an AJAX request is triggered to create and populate sub-categories in dropdown2.

    My first reaction to this problem is to use Chrome Network Tool to capture the POST request when the form is submitted to find out all the parameters. Then, I attempt to simulate this by filling the form with all the...

    Read More
  • Fun scraping with casperjs and phantomjs

    Recently, I have been playing around with CasperJS and PhantomJS for web scraping. I always find screen scraping fun and fascinating. I mean there are just so many applications:

    1. We have bills/accounts all over the place in different websites. The scraping tools can be used to develop a program for personal use that can combine the results in a single place. It also can be used to trigger notifications e.g. bill payments reminder, manga notification, movies notification. The possibility is just endless :)

    2. We want to find and compare the...

    Read More
  • My first spider :)

    Just write my very first web spider which will first crawl mangafox and mangastream websites. Then, it emails me automatically about new mangas that I am currently following. This is fun :)

    I basically make use of simple parsing functions, and the CURL and tidy extensions:

    1) Get the HTML using the curl

    1.  public static function http($target, $ref, $method, $data_array, $incl_head)
    2.      {
    3.          # Initialize PHP/CURL handle
    4.          $ch = curl_init();
    6.          # Prcess data, if presented
    7.          if(is_array($data_array))
    8.          {
    9.  ...
    Read More