How to use the BeautifulSoup In Python

Method 1: Using the Python lxml library

In this approach, we will use the BeautifulSoup module to parse the raw HTML document using html.parser and modify the parsed document and write it to an XML file. Provide the path to open the HTML file and read the HTML file and Parse it using BeautifulSoup’s html.parser, returning an object of the parsed document.

BeautifulSoup(inp, ‘html.parser’)

To remove the DocType HTML, we need to first get the string representation of the document using soup.prettify() and then split the document by lines using splitlines(), returning a list of lines.

soup.prettify().splitlines()

Code:

Python3

# Import the required library 
from bs4 import BeautifulSoup 
  
# Main Function 
if __name__ == '__main__': 
  
    # Provide the path of the html file 
    file = "input.html"
  
    # Open the html file and Parse it  
    # using Beautiful soup's html.parser. 
    with open(file, 'r', encoding='utf-8') as inp: 
        soup = BeautifulSoup(inp, 'html.parser') 
      
    # Split the document by lines and join the lines 
    # from index 1 to remove the doctype Html as it is  
    # present in index 0 from the parsed document. 
    lines = soup.prettify().splitlines() 
    content = "\n".join(lines[1:]) 
  
    # Open a output.xml file and write the modified content. 
    with open("output.xml", 'w', encoding='utf-8') as out: 
        out.write(content)

Output:

Parsing and converting HTML documents to XML format using Python

In this article, we are going to see how to parse and convert HTML documents to XML format using Python.

It can be done in these ways:

Using Ixml module.
Using Beautifulsoup module.

Tags:

#Python-XML #Python #python

Method 1: Using the Python lxml library

How to use the BeautifulSoup In Python

Python3

Parsing and converting HTML documents to XML format using Python

Similar Reads

Contact Us