Friday, April 30, 2010

Web automation and html parsing with python - Part 1: Fetching

Automating web browsing is usefull to download, process, fetch or simply access information without having to click over an important amount of pages, ads etc...

Python standard API offers many useful classes to access the web and use http(s) protocol.

In this article we will present simple exemples of web automation application.

Fetching a web page

Accessing a web page can be done using urllib2 module. This module is included in the standard python API:



Handling cookies


Standard python library named cookielib allow to handle cookies. To illustrate this use-case we will use a state-full python object witch store cookie in a file and load them if they exist.



What about big files ?


In the examples above we download web page and store them in a variable. This approach is not preferable with big file because we store all response content into a variable. If you download a 500Mbyte file then your script will use 500Mb of memory.

To avoid this effect. We use another method for big files:
  1. Open url
  2. Read file descriptor for a certain amount
  3. Write those bytes to a file
  4. Return to step 1



Ok. Cool now we can download 1Tb files without having 1Tb memory on our machine or swapping forever. But another usefull point is to have a progress bar available to know where we are.

Progress can be displayed each time we fetch some data with a print statement. However printing progress each time we fetch few bytes useless and cpu consuming.

One smart solution consist in invoking a watcher witch will know where we are and display progress. To do so we implement a watcher class



And then a fancy progress bar appears:
python bigfile.py 
[======>                                                      ]   168003k/ 1606432k 10.458146% ETA 0:01:17.143839


In this example the watcher thread is launched just before downloading the big file. Each time we pass through the loop we update the watcher thread fetched variable and the watcher thread update its estimations: percentage, estimated download time etc...