Friday, April 30, 2010

Web automation and html parsing with python - Part 1: Fetching

Automating web browsing is usefull to download, process, fetch or simply access information without having to click over an important amount of pages, ads etc...

Python standard API offers many useful classes to access the web and use http(s) protocol.

In this article we will present simple exemples of web automation application.

Fetching a web page

Accessing a web page can be done using urllib2 module. This module is included in the standard python API:



Handling cookies


Standard python library named cookielib allow to handle cookies. To illustrate this use-case we will use a state-full python object witch store cookie in a file and load them if they exist.



What about big files ?


In the examples above we download web page and store them in a variable. This approach is not preferable with big file because we store all response content into a variable. If you download a 500Mbyte file then your script will use 500Mb of memory.

To avoid this effect. We use another method for big files:
  1. Open url
  2. Read file descriptor for a certain amount
  3. Write those bytes to a file
  4. Return to step 1



Ok. Cool now we can download 1Tb files without having 1Tb memory on our machine or swapping forever. But another usefull point is to have a progress bar available to know where we are.

Progress can be displayed each time we fetch some data with a print statement. However printing progress each time we fetch few bytes useless and cpu consuming.

One smart solution consist in invoking a watcher witch will know where we are and display progress. To do so we implement a watcher class



And then a fancy progress bar appears:
python bigfile.py 
[======>                                                      ]   168003k/ 1606432k 10.458146% ETA 0:01:17.143839


In this example the watcher thread is launched just before downloading the big file. Each time we pass through the loop we update the watcher thread fetched variable and the watcher thread update its estimations: percentage, estimated download time etc...

Tuesday, December 15, 2009

Monday, December 7, 2009

Getting started with openFrameworks on OS X

Openframeworks is a wonderfull API for creative computing. At the beginning it can be hard to start developping with OF because there is no standard way to build the framework.

I made a work on openframeworks build system by adding cmake build capability. The result can produce a standard OSX frameworks and results in an easy to settup build.

The first thing to do is installing the openFrameoworks dependencies.

Installing dependencies

Setup openframeworks install prefix.

sudo mkdir -p /usr/local/include
sudo mkdir -p /opt/local/lib 

RtAudio

RtAudio is a cross-platform API to deal with sound. Download rtAudio at http://www.music.mcgill.ca/~gary/rtaudio/release/rtaudio-4.0.6.tar.gz

Compie RtAudio:
 
./configure
make CFLAGS=-m32
sudo cp librtaudio.a /usr/local/lib
sudo cp RtAudio.h  RtError.h /usr/local/include

Poco

Download poco at http://pocoproject.org/download/

Compile POCO:
edit ./build/config/Darwin and add the -m32 CXXFLAGS

--------------------------------------
CXXFLAGS = -Wall -Wno-sign-compare -m32
--------------------------------------

Then compile Poco:

./configure --no-tests --no-samples Darwin
export POCO_BASE=$PWD

for i in CppUnit Foundation XML Net Util; do
    (cd $i ; make static_release)
done 
 
 
Install poco:

for i in Foundation XML Net Util; do
    sudo cp -rf $i/include/* /usr/local/include/
done
sudo cp lib/Darwin/i386/lib*.a /usr/local/lib

FreeImage

Download FreeImage from http://freeimage.sourceforge.net/download.html

cd FreeImage

# edit Makefile.osx to change paths to SDK. For example on my 10.6 (snow leopard):
-------------------------------------
INCLUDE_PPC = -isysroot /Developer/SDKs/MacOSX10.6.sdk
INCLUDE_I386 = -isysroot /Developer/SDKs/MacOSX10.6.sdk 
--------------------------------------
make 

sudo cp Source/FreeImage.h /usr/local/include/
sudo cp libfreeimage.a /usr/local/lib/libFreeImage.a

Fmodex

Download Fmodex from http://www.fmod.org/index.php/download

Install the package.

After installing package should be installed in /Developer/FMOD Programmers API Mac
Copy the libraries and includes files to /opt/of

cd /Developer/FMOD\ Programmers\ API\ Mac/
sudo cp api/inc/* /usr/local/include
sudo cp api/lib/* /usr/local/lib

GLee

Download GLee from http://www.opengl.org/sdk/libs/GLee/

mkdir GLee
cd GLee
tar zxvf ../GLee-5.4.0-src.tar.gz
./configure CXXFLAGS="-m32 -framework CoreFoundation -framework OpenGL"
make
sudo cp GLee.h /usr/local/include
sudo cp libGLee.so /usr/local/lib 

Building openFrameworks

Download openframeworks from my unofficial github repository. The repository includes CMake files and a patch for openframeworks to be compatible with the lastest RtAudio API.

git clone git://github.com/dopuskh3/openFrameworks.git
cd openFrameworks
cmake -DCMAKE_CXX_FLAGS=-m32 -DCMAKE_INSTALL_PREFIX=/usr/local
make
sudo make install 

Starting a sample project

Now we can start a new project in xcode...

 



 
Drag the sample file into the project to add source code sample. Take theses source code file from openframeworks example advancedGraphicExample:
 

Add a new build target:
 





Setup include paths:
  Add openFrameworks.frameworks and OpenGL.framework to your project:
 
 




Drag the project source code into the target:


Click "build and run":  
 
 

 

Monday, October 5, 2009

Using pylucene to index audio files

Lucene is a quite efficient full-text indexing solution. I tried to use it to index my audio file tags to be able to launch mplayer or command line audio player without having to use complex and time consuming 'find' command to build playlists.

Here is a quick'n'dirty solution:


The search function is quite simple too:



A quick demo:
Indexing music database:

time ./tsearch.py index /home/fv/music/SANE /home/fv/musicindex
16701 files indexed 
Done 16702
./tsearch.py index /home/fv/music/SANE /home/fv/musicindex  21,67s user 10,18s system 3% cpu 14:14,36 total

Searching:

time ./tsearch.py search "love OR hate" /home/fv/musicindex > playlist.m3u
./tsearch.py search "love OR hate" /home/fv/musicindex  0,52s user 0,09s system 14% cpu 4,310 total

A more complex search:


time ./tsearch search "love OR hate OR (title: rain in blood AND artist: slayer)" /home/fv/musicindex > playlist.m3u
./tsearch.py search  /home/fv/musicindex  0,48s user 0,07s system 74% cpu 0,730 total

Although this code sample is not perfect but consider it more as a proof of concept than a ready to use solution.

The full code:



The next step is to clean my audio files tags by retrieving tags from Last.fm webservice and put them in the "genre" tag. Then automatically retrieve songs lyrics and index it using the same method.

Monday, September 21, 2009

Reading RSS feeds in GNU screen

As a lot of system administrator I use GNU screen most of the time.

GNU screen is a really wonderfull tool to keep your working sessions along days.

Screen can be configured to have a status line witch can display various standard format strings. A particularly useful configuration directive is the backtick directive. This directive allows to run a custom command and update the hardstatus line each time the command emit a new line.

Using this features I developped screenfeeder: an rss reader witch can display rss feed into my harstatus line.

Configuring your ~/.screenrc

Here is my hardstatus and backtick configuration:

#Run the rss reader as #42 backtick 
backtick 42 0 0 "/home/fv/.screenfeeder/screenfeeder" "/home/fv/.screenfeeder/feeds"
# CTRL+A f open the webbrowser to the current feed entry url
bind f screen -t 'rss_feed' 10 /home/fv/.screenfeeder/screenfeeder
# display the #42 backtick in the hardstatus line 
hardstatus alwayslastline "%{+b kw}%H%{kg}|%c|%42`|%{ky}%d.%m.%Y|%{kr}(load:%l)%-0=%{kw}"

Copy the screenfeeder script into ~/.screenfeeder/screenfeeder
Put your favorits feeds into ~/.screenfeeder/feeds like this (one feed per line):

http://www.openframeworks.cc/forum/rss.php
http://www.lemonde.fr/rss/une.xml
http://linuxfr.org/backend/news/rss20.rss

The script works by saving the current url into a file named ~/currenturl-. The pid of the parent process id (the current screen process id). When screenfeeder is invoked without arguments from the same screen process it calls the webborowser module to open the url stored in this file by gessing the parent process id.

As a result. When you use CRTL-a f shortcut, the browser should open the current item.


screenfeeder demo:


Screenfeeder from dopuskh3 on Vimeo.


References:
screen manual page
screenfeeder project page on github

Saturday, September 19, 2009

Building OpenFrameworks with CMake

Why ?

Openframework programs can be hard to build particulary when you are not using XCode or Code::Blocks.

I personally use linux (ubuntu) and prefer to use my favorite editor (vim) to program.


I decided a long time ago to create a CMake skeleton to be able to build openframeworks easily. I choosed to use CMake because i thought it could be useful - one day - for the ofx community where most of people are using XCode or Code::blocks. CMake can generate C::B and Xcode project files.

Prerequist

You can download OpenFrameworks CMake skelton on my github page.


git clone git://github.com/dopuskh3/ofx-cmake-build.git  
  

Edit CMakeBase.txt

This file provides all necessary checks to build openframeworks source code:

  • finds necessary include files
  • finds libraries (fmodex, poco, unicap...)
Edit this file to reflect your system configuration an openframworks sourcecode. Here is my configuration:
#######################################################################################################
########### Configuration Vars #########################################################################
########################################################################################################

# The path where stants openFrameworks's sources
set ( ofx_sources_directory "/home/fv/Dev/openFrameworks/ofx-dev/libs/openFrameworks" )

# Additional include directories 
set ( custom_include_dirs "/usr/include/libavformat;/usr/include/libavcodec;/usr/include/libswscale")

# Poco include path where to find Poco/Poco.h and libPocoFoundation                                                            
set (poco_includes "/usr/include" )
set (poco_libdir " ")

# GLee include path 
set (glee_includes "/home/fv/Dev/openFrameworks/ofx-dev/libs/GLee/include" )
set (glee_libdir   "/home/fv/Dev/openFrameworks/ofx-dev/libs/GLee/lib" )

# FModex 
set (fmodex_includes "/home/fv/Dev/openFrameworks/ofx-dev/libs/fmodex/inc" )
set (fmodex_libdir   "/home/fv/Dev/openFrameworks/ofx-dev/libs/fmodex/lib/linux/" )

# RtAudio
set (rtaudio_includes "/home/fv/Dev/openFrameworks/ofx-dev/libs/rtAudio/include")
set (rtaudio_libdir   "/home/fv/Dev/openFrameworks/ofx-dev/libs/rtAudio/lib")

# FreeImage 
set ( freeimage_includes "/home/fv/Dev/openFrameworks/ofx-dev/libs/freeimage/include" )
set ( freeimage_libdir   "/home/fv/Dev/openFrameworks/ofx-dev/libs/freeimage/lib" )

# For linux only 
set (unicap_includes "/home/fv/Dev/openFrameworks/ofx-dev/libs/unicap/include")
set (unicap_libdir   "/home/fv/Dev/openFrameworks/ofx-dev/libs/unicap/lib")

set (asound_includes "")
set (asound_libdir "")

set (raw1394_includes "")
set (raw1394_libdir "")

....
  

Creating a project from scratch

Create a subdirectory to store your project:
cd ofx-cmake-build/
mkdir myOfxProject/
mkdir myOfxProject/src
  
Copy the provided CMakeFiles.txt into your project root directory. Edit this file to change the project name and add some dependencies:
cd myOfxProject/
cp /path/to/ofx-cmake-build/sampleProgram/CMakeFiles.txt . 
 
Edit your CMakeFiles.txt to suit your freshly created project.

CMakeFiles.txt:
cmake_minimum_required(VERSION 2.6)                                                                                            

# project name
project(myOfxProject)

# path to CMakeBase.txt file
include ( ../CMakeBase.txt ) 

# add ofx includes directories for dependencies defined in CMakeBase.txt
include_directories ( ${ofx_includes} ) 

file ( GLOB_RECURSE app_sources_files src/*)

add_executable( myOfxProject
    ${app_sources_files}
    ${OFX_SOURCE_FILES} ) # Defined in CMakeBase.txt

set ( libs ${ofx_libs}) # Defined in CMakeBase.txt

target_link_libraries(manoProut ${libs} )
  
You're now ready to build your project.
mkdir build
cd build
cmake ../
make 
  
By default cmake generate unix makefiles. You can use the -G switch to use another generator (Xcode, code::blocks...)

A more complicated example: Using addons

You may want to use available addons to build more complicated projects. I personally use ofxOpenCv for one of my project. ofxOpenCv addon depends on ofxVectorMath addon. You can copy both ofxOpenCv and ofxVectorMath addons sourcecodes in your project root. Your project layout should be like this:
ofxSampleProjectWithAddons
|-- ofxOpenCv
|   `-- src
|       |-- ofxCvBlob.h
|       |-- ofxCvColorImage.cpp
|       |-- ofxCvColorImage.h
|       |-- ofxCvConstants.h
|       |-- ofxCvContourFinder.cpp
|       |-- ofxCvContourFinder.h
|       |-- ofxCvFloatImage.cpp
|       |-- ofxCvFloatImage.h
|       |-- ofxCvGrayscaleImage.cpp
|       |-- ofxCvGrayscaleImage.h
|       |-- ofxCvImage.cpp
|       |-- ofxCvImage.h
|       |-- ofxCvMain.h
|       |-- ofxCvShortImage.cpp
|       |-- ofxCvShortImage.h
|       `-- ofxOpenCv.h
|-- ofxVectorMath
|   `-- src
|       |-- ofxMatrix3x3.cpp
|       |-- ofxMatrix3x3.h
|       |-- ofxPoint2f.cpp
|       |-- ofxPoint2f.h
|       |-- ofxPoint3f.cpp
|       |-- ofxPoint3f.h
|       |-- ofxPoint4f.cpp
|       |-- ofxPoint4f.h
|       |-- ofxVec2f.cpp
|       |-- ofxVec2f.h
|       |-- ofxVec3f.cpp
|       |-- ofxVec3f.h
|       |-- ofxVec4f.cpp
|       |-- ofxVec4f.h
|       `-- ofxVectorMath.h
`-- src
    |-- main.cpp
    |-- testApp.cpp
    `-- testApp.h

  


You can had your includes and library checks into the project.

CMakeFiles.txt:
cmake_minimum_required(VERSION 2.6)                                                                                            

project(ofxSampleProjectWithAddons)

include ( ../CMakeBase.txt ) 

# add ofx includes directories for dependencies 
include_directories ( ${ofx_includes} ) 

file ( GLOB_RECURSE app_sources_files src/*)

# VectorMath addon ##############################
# add VectorMath addon sourcecode
file ( GLOB_RECURSE ofxVectorMath ofxVectorMath/src/* )
# add includes path for this addon
include_directories ( ofxVectorMath/src/ )

# OpenCv addon ##################################
# add ofxOpenCv source code
file ( GLOB_RECURSE ofxOpenCv ofxOpenCv/src/* )
# add include path for this addon
include_directories ( ofxOpenCv/src/ )

# search for opencv library and includes using pkg-config 
include(FindPkgConfig)
pkg_search_module(cv opencv)
include_directories ( ${cv_INCLUDE_DIRS} )
#################################################

# create an executable with all source files
add_executable( ofxSampleProjectWithAddons
    ${app_sources_files}
    ${ofxOpenCv}
    ${ofxVectorMath}
    ${OFX_SOURCE_FILES} )

# link with cv libraries and openFrameworks dependencies
set ( libs ${ofx_libs} ${cv_LIBRARIES})
target_link_libraries(manoProut ${libs} )