Using Feedparser in Python

Overview

In this post we will take a look on how we can download and parse syndicated feeds with Python.

The Python module we will use for that is “Feedparser”.

The complete documentation can be found here.

What is RSS?

RSS stands for Rich Site Summary and uses standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video.

An RSS document (called “feed”, “web feed”, or “channel”) includes full or summarized text, and metadata, like publishing date and author’s name. [source]

What is Feedparser?

Feedparser is a Python library that parses feeds in all known formats, including Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]

RSS Elements

Before we install the feedparser module and start to code, let’s take a look at some of the available RSS elements.

The most commonly used elements in RSS feeds are “title”, “link”, “description”, “publication date”, and “entry ID”.

The less commonnly used elements are “image”, “categories”, “enclosures” and “cloud”.

Install Feedparser

To install feedparser on your computer, open your terminal and install it using “pip” (A tool for installing and managing Python packages)

sudo pip install feedparser

To verify that feedparser is installed, you can run a “pip list”.

You can of course also enter the interactive mode, and import the feedparser module there.

If you see an output like below, you can be sure it’s installed.

>>> import feedparser
>>>

Now that we have installed the feedparser module, we can go ahead and begin to work with it.

Getting the RSS feed

You can use any RSS feed that you want. Since I like to read Reddit, I will use that for my example.

Reddit is made up of many sub-reddits, the one I am particular interested in for now is the “Python” sub-reddit.

The way to get the RSS feed, is just to look up the URL to that sub-reddit and add a “.rss” to it.

The RSS feed that we need for the python sub-reddit would be:
http://www.reddit.com/r/python/.rss

Using Feedparser

You start your program with importing the feedparser module.

import feedparser

Create the feed. Put in the RSS feed that you want.

d = feedparser.parse('http://www.reddit.com/r/python/.rss')

The channel elements are available in d.feed (Remember the “RSS Elements” above)

The items are available in d.entries, which is a list.

You access items in the list in the same order in which they appear in the original feed, so the first item is available in d.entries[0].

Print the title of the feed

print d['feed']['title']

>>> Python

Resolves relative links

print d['feed']['link']

>>> http://www.reddit.com/r/Python/

Parse escaped HTML

print d.feed.subtitle

>>> news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python

See number of entries

print len(d['entries'])

>>> 25

Each entry in the feed is a dictionary. Use [0] to print the first entry.

print d['entries'][0]['title']

>>> Functional Python made easy with a new library: Funcy

Print the first entry and its link

print d.entries[0]['link']

>>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_library/

Use a for loop to print all posts and their links.

for post in d.entries:
    print post.title + ": " + post.link + "
"

>>>
Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/
comments/1oej74/functional_python_made_easy_with_a_new_
library/

Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/
python_packages_open_sourced/

PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/
pyeda_0150_released/

PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/
pymongo_263_released/
.....
.......
........

Reports the feed type and version

print d.version

>>> rss20

Full access to all HTTP headers

print d.headers          	

>>> 
{'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server':
"'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34
GMT', 'content-type': 'text/xml; charset=UTF-8'}

Just get the content-type from the header

print d.headers.get('content-type')

>>> text/xml; charset=UTF-8

Using the feedparser is an easy and fun way to parse RSS feeds.

Sources

http://www.slideshare.net/LindseySmith1/feedparserrel=”nofollow noreferrer”
http://code.google.com/p/feedparser/rel=”nofollow noreferrer”