HTML Parser tutorial helps users to learn basic techniques and coding for the parsing text files formatted in HyperText Markup Language(HTML), XHTML & Python. Also, those who are searching for the ultimate tutorial for Web Scraping and Parsing HTML in Python with Beautiful Soup can refer to this page and gain full knowledge about Python web scraping and HTML Parsing. From this tutorial, python programming language beginners and experienced candidates can acquire complete information on HTML Parser: How to scrape HTML content with Example.

The tutorial of HTML Parser in Python includes the following stuff: 

html.parser — Simple HTML and XHTML parser

This module describes a class HTMLParser that helps as the foundation for parsing text files arranged in HTML and XHTML.

Syntax: class html.parser.HTMLParser (*, convert_charrefs=True)

It creates a parser example so that able to parse invalid markup.

If convert_charrefs is True, all character references (excluding the ones in script/style elements) are inevitably converted to the respective Unicode characters.

An HTMLParser example catered HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are experienced. The programmer or user needs to subclass HTMLPraser and overrise its methods to execute the desired behavior.

Prerequisites for Web Scraping and Parsing HTML in Python with Beautiful Soup

To gain proper knowledge about the HTML Parser in Python, you should get a good grip on the following required concepts:

  1. Python 3
  2. Basic HTML
  3. Urllib2 (not mandatory but recommended)
  4. Basic OOP concepts
  5. Python data structures – Lists, Tuples

Also Check: 

Why parse HTML?

Python is one of the languages that is extensively used to scrape data from web pages. This is a very easy way to gather information. For instance, it can be very helpful for quickly extracting all the links in a web page and checking for their validity. This is only one example of many potential uses… so read on!

The next question is: where is this information extracted from? To answer this, let’s use an example. Go to the website NYTimes and right-click on the page. Select View page source or simply press the keys Ctrl + u on your keyboard. A new page opens containing a number of links, HTML tags, and content. This is the source from which the HTML Parser scrapes content for NYTimes!

What is HTML Parser in Python?

HTML Parser, as the name suggests, simply parses a web page’s HTML/XHTML content and provides the information we are looking for. This is a class that is defined with various methods that can be overridden to suit our requirements. Note that to use HTML Parser, the web page must be fetched. For this reason, HTML Parser is often used with urllib2.

To use the HTML parser, you have to import this module:

from html.parser import HTMLParser

HTMLPraser in Python

Methods in HTML Parser

  1. HTMLParser.feed(data) – It is through this method that the HTML Parser reads data. This method accepts data in both unicode and string formats. It keeps processing data as it gets and waits for incomplete data to be buffered. Only after the data is fed using this method can other methods of the HTML Parser be called.
  2. HTMLParser.close() – This method is called to mark the end of the input feed to the HTML Parser.
  3. HTMLParser.reset() – This method resets the instance and all unprocessed data is lost.
  4. HTMLParser.handle_starttag(tag, attrs) – This method deals with the start tags only, like <title>.  The tag argument refers to the name of the start tag whereas the attrs refers to the content inside the start tag. For example, for the tag <Meta name=”PT”> the method call would be handle_starttag(‘meta’, [(‘name’,’PT’)]). Note that the tag name was converted to lowercase and the contents of the tag were converted to key,value pairs. If a tag has attributes they will be converted to a key, value pair tuple and added to the list. For example, in the tag <meta name=”application-name” content=”The New York Times” /> the method call would be handle_starttag(‘meta’, [(‘name’,’application-name’),(‘content’.’The New York Times’)]).
  5. HTMLParser.handle_endtag(tag) – This method is pretty similar to the above method, except that this deals with only end tags like </body>. Since there will be no content inside an end tag, this method takes only one argument which is the tag itself. For example, the method call for </body> will be: handle_endtag(‘body’). Similar to the handle_starttag(tag,attrs) method, this also converts tag names to lowercase.
  6. HTMLParser.handle_startendtag(tag, attrs) – As the name suggests, this method deals with the start end tags like, <a href=http://nytimes.com />. The arguments tag and attrs are similar to the HTMLParser.handle_starttag(tag, attrs) method.
  7. HTMLParser.handle_data(data) – This method is used to deal with data/content like <p> ……. </p>. This is particularly helpful when you want to look for specific words or expressions. This method combined with regular expressions can work wonders.
  8. HTMLParser.handle_comment(data) – As the name suggests, this method is used to deal with comments like <!–ny times–> and the method call would be like HTMLParser.handle_comment(‘ny times’).

Whew! That’s a lot to process, but these are some of the main (and most useful) methods of HTML Parser. If your head is swirling don’t worry, let’s look at an example to make things a little more clear.

How does HTML Parser work?

Now that you are equipped with theoretical knowledge, let’s test things out practically. To try out the below example you must have urllib2 installed or follow the below steps to install it:

  1. Install pip
  2. Install urllib – pip install urllib2

Example HTML Parser Application

from html.parser import HTMLParser
import urllib.request as urllib2

class MyHTMLParser(HTMLParser):

   #Initializing lists
   lsStartTags = list()
   lsEndTags = list()
   lsStartEndTags = list()
   lsComments = list()

   #HTML Parser Methods
   def handle_starttag(self, startTag, attrs):
       self.lsStartTags.append(startTag)

   def handle_endtag(self, endTag):
       self.lsEndTags.append(endTag)

   def handle_startendtag(self,startendTag, attrs):
       self.lsStartEndTags.append(startendTag)

   def handle_comment(self,data):
       self.lsComments.append(data)

#creating an object of the overridden class
parser = MyHTMLParser()

#Opening NYTimes site using urllib2
html_page = html_page = urllib2.urlopen("https://www.nytimes.com/")

#Feeding the content
parser.feed(str(html_page.read()))

#printing the extracted values
print(“Start tags”, parser.lsStartTags)
#print(“End tags”, parser.lsEndTags)
#print(“Start End tags”, parser.lsStartEndTags)
#print(“Comments”, parser.lsComments)

Alternatively, if you don’t want to install urllib2, you can directly feed a string of HTML tags to the parser like so:

parser = MyHTMLParser()
parser.feed('<html><body><title>Test</title></body>')

Print one output at a time to avoid crashing as you are dealing with a lot of data!

NOTE: In case you get the error: IDLE cannot start the process, start your Python IDLE in administrator mode. This should solve the problem.

Parsing and navigating HTML with BeautifulSoup

Once you start writing more python code to parse the content that we need, let’s first have a glance at the HTML that’s administered by the browser. Everyone knows that all web pages are not the same, and sometimes getting the correct data out of them needs a piece of creativity, pattern recognition, and experimentation.

Our aim is to download a pack of MIDI files, but there are plenty of duplicate tracks on this webpage as well as remixes. Whenever you’re creating code to parse via a web page, it’s generally helpful to utilize the available developer tools for you in the most modern browsers.

Now, let’s make use of the find_all method to go through all of the page links, but utilize regular expressions to filter through them and then get links that include MIDI files whose text has no parentheses, that may permit us to eliminate all of the duplicates and remixes.

To understand the process, create a file named nes_midi_scraper.py and add the following code to it:

import re

import requests
from bs4 import BeautifulSoup


vgm_url = 'https://www.vgmusic.com/music/console/nintendo/nes/'
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text, 'html.parser')


if __name__ == '__main__':
    attrs = {
        'href': re.compile(r'\.mid$')
    }

    tracks = soup.find_all('a', attrs=attrs, string=re.compile(r'^((?!\().)*$'))

    count = 0
    for track in tracks:
        print(track)
        count += 1
    print(len(tracks))

By the above example, it will filter through all of the MIDI files that we want on the page, print out the link tag corresponding to them, and then print how many files we filtered.

Run the code in your terminal with the command python nes_midi_scraper.py.

Exceptions

HTMLParser.HTMLParseError – This exception is elevated when the HTML Parser confronts corrupt data. This exception provides information in the form of three attributes. The msg attribute informs you the reason for the error, the lineno attribute defines the line number where the error occurred and the offset attribute furnishes the accurate character where the construct starts.

Conclusion

And that’s the end of this article on HTML Parser. Make sure to try out additional examples on your own to develop your understanding! Should you have the need for an out-of-the-box email parser or a pdf table parsing solution, our sister sites have that for you until you get your python mojo in order. Do read about BeautifulSoup which is another amazing module in Python that helps in HTML scraping. However, to use this module, you will have to install it. Keep learning and happy Python programming!

Leave a Reply

Your email address will not be published. Required fields are marked *