Python offers a lot of powerful and easy to use tools for scraping websites. One of Python’s useful modules to scrape websites is known as Beautiful Soup.

In this example we’ll provide you with a Beautiful Soup example, known as a ‘web scraper’. This will get data from a Yahoo Finance page about stock options. It’s alright if you don’t know anything about stock options, the most important thing is that the website has a table of information you can see below that we’d like to use in our program. Below is a listing for Apple Computer stock options.

First we need to get the HTML source for the page. Beautiful Soup won’t download the content for us, we can do that with Python’s urllib module, one of the libraries that comes standard with Python.

### Fetching the Yahoo Finance Page

Python 3.x

from urllib.request import urlopen

optionsUrl = 'http://finance.yahoo.com/q/op?s=AAPL+Options'
optionsPage = urlopen(optionsUrl)

Python 2.x

from urllib import urlopen

optionsUrl = 'http://finance.yahoo.com/q/op?s=AAPL+Options'
optionsPage = urlopen(optionsUrl)

This code retrieves the Yahoo Finance HTML and returns a file-like object.

If you go to the page we opened with Python and use your browser’s “get source” command you’ll see that it’s a large, complicated HTML file. It will be Python’s job to simplify and extract the useful data using the BeautifulSoup module. BeautifulSoup is an external module so you’ll have to install it. If you haven’t installed BeautifulSoup already, you can get it here.

The following code will load the page into BeautifulSoup:

from bs4 import BeautifulSoup
soup = BeautifulSoup(optionsPage)

### Beautiful Soup Example: Searching

Now we can start trying to extract information from the page source (HTML). We can see that the options have pretty unique looking names in the “symbol” column something like AAPL130328C00350000. The symbols might be slightly different by the time you read this but we can solve the problem by using BeautifulSoup to search the document for this unique string.

Let’s search the soup variable for this particular option (you may have to substitute a different symbol, just get one from the webpage):

>>> soup.findAll(text='AAPL130328C00350000')
[u'AAPL130328C00350000']

This result isn’t very useful yet. It’s just a unicode string (that’s what the ‘u’ means) of what we searched for. However BeautifulSoup returns things in a tree format so we can find the context in which this text occurs by asking for it’s parent node like so:

>>> soup.findAll( text='AAPL130328C00350000')[0].parent
<a href="/q?s=AAPL130328C00350000">AAPL130328C00350000</a>

We don’t see all the information from the table. Let’s try the next level higher.

>>> soup.findAll(text='AAPL130328C00350000')[0].parent.parent
<td><a href="/q?s=AAPL130328C00350000">AAPL130328C00350000</a></td>

And again.

>>> soup.findAll(text='AAPL130328C00350000')[0].parent.parent.parent
<tr><td nowrap="nowrap"><a href="/q/op?s=AAPL&amp;amp;k=110.000000"><strong>110.00</strong></a></td><td><a href="/q?s=AAPL130328C00350000">AAPL130328C00350000</a></td><td align="right"><b>1.25</b></td><td align="right"><span id="yfs_c63_AAPL130328C00350000"> <b style="color:#000000;">0.00</b></span></td><td align="right">0.90</td><td align="right">1.05</td><td align="right">10</td><td align="right">10</td></tr>

Bingo. It’s still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.

optionsTable = [
[x.text for x in y.parent.contents]
for y in soup.findAll('td', attrs={'class': 'yfnc_h', 'nowrap': ''})
]


This code is a little dense, so let’s take it apart piece by piece. The code is a list comprehension within a list comprehension. Let’s look at the inner one first:

for y in soup.findAll('td', attrs={'class': 'yfnc_h', 'nowrap': ''})

This uses BeautifulSoup‘s findAll function to get all of the HTML elements with a td tag, a class of yfnc_h and a nowrap of nowrap. We chose this because it’s a unique element in every table entry.

If we had just gotten td‘s with the class yfnc_h we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class is one of Python’s reserved words. From the table above it would return this:

<td nowrap="nowrap"><a href="/q/op?s=AAPL&amp;amp;k=110.000000"><strong>110.00</strong></a></td>

We need to get one level higher and then get the text from all of the child nodes of this node’s parent. That’s what this code does:

[x.text for x in y.parent.contents]

This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You’ll find a lot more tools for searching and validating HTML documents.