What is BeautifulSoup?
BeautifulSoup is a third-party Python library from Crummy.
The library is designed for quick turnaround projects like screen-scraping
What can it do?
Beautiful Soup parses anything you give it and does the tree traversal stuff for you.
You can use it to find all the links of a website
Find all the links whose urls match “foo.com”
Find the table heading that’s got bold text, then give me that text.
Find every “a” element that has an href attribute etc.
- Beautiful Soup 4 Python
- Using urllib2 with BeautifulSoup in Python
- Python Beautiful Soup Example: Yahoo Finance Scraper
What do I need?
You need to first install the BeautifulSoup module and then import the module into your script.
You can install it with pip install beautifulsoup4 or easy_install beautifulsoup4.
It’s also available as the python-beautifulsoup4 package in recent versions of Debian and Ubuntu.
Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.
Before we start, we have to import two modules => BeutifulSoup and urllib2
Urlib2 is used to open the URL we want.
Since BeautifulSoup is not getting the web page for you, you will have to use the urllib2 module to do that.
#import the library used to query a website import urllib2
Search and find all html tags
We will use the soup.findAll method to search through the soup object to match for text and html tags within the page.
from BeautifulSoup import BeautifulSoup import urllib2 url = urllib2.urlopen("http://www.python.org") content = url.read() soup = BeautifulSoup(content) links = soup.findAll("a")
That will print out all the elements in python.org with an “a” tag.
That is the tag that defines a hyperlink, which is used to link from one page to another
Find all links on Reddit
Fetch Reddit webpage’s HTML by using Python’s built-in urllib2 module.
Once we have the actual HTML for the page, we create a new BeautifulSoup class to take advantage of its simple API.
from BeautifulSoup import BeautifulSoup import urllib2 pageFile = urllib2.urlopen("http://www.reddit.com") pageHtml = pageFile.read() pageFile.close() soup = BeautifulSoup("".join(pageHtml)) #sAll = soup.findAll("li") sAll = soup.findAll("a") for href in sAll: print href
Website Scrap the Huffington Post
Here is another example I saw on newthinktank.com
from urllib import urlopen from BeautifulSoup import BeautifulSoup import re # Copy all of the content from the provided web page webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/LatestNews').read() # Grab everything that lies between the title tags using a REGEX patFinderTitle = re.compile('') # Grab the link to the original article using a REGEX patFinderLink = re.compile('') # Store all of the titles and links found in 2 lists findPatTitle = re.findall(patFinderTitle,webpage) findPatLink = re.findall(patFinderLink,webpage) # Create an iterator that will cycle through the first 16 articles and skip a few listIterator =  listIterator[:] = range(2,16) soup2 = BeautifulSoup(webpage) #print soup2.findAll("title") titleSoup = soup2.findAll("title") linkSoup = soup2.findAll("link") for i in listIterator: print titleSoup[i] print linkSoup[i] print " "