Programming, Python

Python – Scrapping Javascript Driven Web

Hi, I am migrating!

Because of the annoying fact that latex support is supper weak for official wordpress, I am moving to community wordpress.

For this post visit: http://learningnotes.fromosia.com/index.php/2017/04/06/python-scrapping-javascript-driven-web/

Required Packages

dryscrape

Note that this package has no official Windows release. This post will be based on Ubuntu.

Installation

sudo apt-get install qt5-default qt5-qmake libqt5webkit5-dev xvfb
sudo pip -H install webkit-server
sudo pip -H install dryscrape

Tutorial

Using XPath to locate web content

Commonly used syntax:

Syntax Effect
// Search all children recursively under current node
/ Search all children under current node
tag[@att=’val’] Search all ‘tag’ with ‘att’ attribute equal ‘val’

Examples

XML Content

<div><span id="DecentTag">First content to scrape </span>
<span class="Distraction"><span class="Distraction">
<span class="DecentClass"> Second content to scrape</span></span></span>
<div><span class="InnerSelf">Nope, nope, nope</span></div>
</div>

Then to get the three contents, you can use the following syntax

id('DecentTag')
/body/div/span[@class='DecentClass']
/body//span[@class='InnerSelf']

Using python to scrape web contents

If your target data doesn’t requires javascript running on the client, you can simply use python’s standard packages requests to obtain a string of web content following the example below

import lxml.html
import requests

url = "http://stackoverflow.com/help"
xpath = "id('help-index')/div[2]"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
tree = lxml.html.fromstring(r.content)
element = tree.xpath(xpath)

content = element.text_content()

Using python to scrape javascript driven web

If your target is updated by javascript from time to time, simple python request will not obtain what you want to get. Here we introduce a linux python package dryscrape. A simple example is given below:

import  dryscrape

dryscrape.start_xvfb()
sess = dryscrape.Session()
sess.visit("http://stackoverflow.com/help")

q = sess.at_xpath("some path")
content = q.text()

As simple as that

Advertisements

One thought on “Python – Scrapping Javascript Driven Web

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s