Remove HTML Tags From String in Python

While collecting data, we often need to process texts with HTML tags. In this article, we will discuss different ways to remove HTML tags from string in python.

Remove HTML tags from string in python Using Regular Expressions

Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.

The sub() method takes the pattern of the sub-string that needs to be replaced as its first argument, the string that will be substituted at the place of the replaced sub-string as the second input argument, and the original string as the third input argument.

After execution, it returns the modified string by replacing all the occurrences of the substring given as the first input argument with the substring given as the second input argument in the original string.

To remove HTML tags from string in python using the sub() method, we will first define a pattern that represents all the HTML tags. For this, we will create a pattern that reads all the characters inside an HTML tag <>.  The pattern is as follows.

After creating the pattern, we will substitute each substring having the defined pattern with an empty string "" using the sub() method. In this way, we can remove the HTML tags from any given string in Python.

Following is the source code to remove HTML tags from string in python using the sub() method. 

Output:

Remove HTML tags from string in python Using the lxml Module

Instead of using regular expressions, we can also use the lxml module to remove HTML tags from string in python. For this, we will first parse the original string using the fromstring() method.

The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml.etree._ElementUnicodeResult data type. Therefore, we need to convert the output to string using the str() function.

You can observe this in the following example.

Output:

Remove HTML tags from string in python Using the Beautifulsoup Module

Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

In this approach, we will first create a parser to parse the string that contains HTML tags using the BeautifulSoup() method. The BeautifulSoup() method takes the original string as its first input argument and the type of parser to be created as its second input argument, which is optional. After execution, it returns the parser. We can invoke the get_text() method on the parser to get the output string. 

The following program demonstrates how to remove HTML tags from string in python using the BeautifulSoup module.

Output:

Conclusion

In this article, we have discussed different ways to remove HTML tags from string in python. While the approaches with the lxml module and the BeautifulSoup modules create a parser to extract text from the HTML string, the approach using regular expressions focuses entirely on removing the HTML tags. Although the outputs are the same, this is the basic difference between these approaches. You can use any of the approaches according to your convenience.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *