技術メモBlog

Saturday, January 27, 2007

urllib2で、User-agentなどを設定

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Python2.4.3'),('Accept-Language','ja,en-us;q=0.7,en;q=0.3')]
>>> url = 'http://www.example.com/'
>>> f = opener.open(url)

Tuesday, January 23, 2007

Python urllibでcontent-typeをとる

>>> from urllib2 import urlopen
>>> url1 = 'http://www.example.jp'
>>> f = urlopen(url1)
>>> f.info().get('content-type')
'text/html;charset=utf-8'

>>> f.info().get('content-type', 'text/html;charset=utf-8').split(';')[0]
'text/html'

Saturday, January 20, 2007

PyLuceneインストール

ダウンロード
http://downloads.osafoundation.org/PyLucene/src/
現在の最新版 PyLucene-src-2.0.0-5.tar.gz

$ tar zxvf PyLucene-src-2.0.0-5.tar.gz
$ cd PyLucene-src-2.0.0-5

PyLucene-src-2.0.0-5内に、db-4.4.20フォルダを移動

Malefileを編集
$ vi Makefile

DB_VER=4.4.20

# Linux (with gcc 3.4.4 and libgcj statically linked)
PREFIX=/usr/local
PREFIX_PYTHON=$(PREFIX)
#GCJ_HOME=/usr/local/gcc-3.4.4
GCJ_HOME=/usr
GCJ_LIBDIR=$(GCJ_HOME)/lib
GCJ_STATIC=0
LIB_INSTALL=libstdc++.so.6 libgcc_s.so.1
DB=$(PYLUCENE)/db-$(DB_VER)
PREFIX_DB=$(PREFIX)/BerkeleyDB.$(DB_LIB_VER)
ANT=ant
PYTHON=$(PREFIX_PYTHON)/bin/python


$ make
# make install

Berkeley DB 4.4.20インストール

ダウンロード
http://www.oracle.com/technology/software/products/berkeley-db/db/index.html

$ cd db-4.4.20/build_unix
$ ../dist/configure
$ make
# make install

/usr/local/BerkeleyDB.4.4 にインストールされる。

/ect/ld.so.confに下記を追加
/usr/local/BerkeleyDB.4.4/lib

# ldconfig

Monday, January 08, 2007

PythonでHTMLパース

BeautifulSoup

使い方
>>> import urllib2
>>> html=urllib2.urlopen('http://www.XXXXXXX.jp')
>>> from BeautifulSoup import BeautifulSoup as BF
>>> s = BF(html)
>>> list = s.body.findAll('a')
>>> list[3].string
u'hogehogehogehogehoge'
>>> list[3]['href']
u'http://XXXXXX/hoge/hoge'


メモ
>>> s.contents[2].contents[3].contents[1].contents[1].contents[1].name
u'a' #Tag name
>>> s.contents[2].contents[3].contents[1].contents[1].contents[1].string
u'hogehoge' #text
>>> s.h1
>>> s.a
>>> s.head
>>> s.body.next
>>> s.body.next.next.next.next.findAll('ul')