Python standard API offers many useful classes to access the web and use http(s) protocol.
In this article we will present simple exemples of web automation application.
Fetching a web page
Accessing a web page can be done using urllib2 module. This module is included in the standard python API:Handling cookies
Standard python library named cookielib allow to handle cookies. To illustrate this use-case we will use a state-full python object witch store cookie in a file and load them if they exist.
What about big files ?
In the examples above we download web page and store them in a variable. This approach is not preferable with big file because we store all response content into a variable. If you download a 500Mbyte file then your script will use 500Mb of memory.
To avoid this effect. We use another method for big files:
- Open url
- Read file descriptor for a certain amount
- Write those bytes to a file
- Return to step 1
Ok. Cool now we can download 1Tb files without having 1Tb memory on our machine or swapping forever. But another usefull point is to have a progress bar available to know where we are.
Progress can be displayed each time we fetch some data with a print statement. However printing progress each time we fetch few bytes useless and cpu consuming.
One smart solution consist in invoking a watcher witch will know where we are and display progress. To do so we implement a watcher class
And then a fancy progress bar appears:
python bigfile.py [======> ] 168003k/ 1606432k 10.458146% ETA 0:01:17.143839
In this example the watcher thread is launched just before downloading the big file. Each time we pass through the loop we update the watcher thread fetched variable and the watcher thread update its estimations: percentage, estimated download time etc...