Remove HTML Tags From String in Python

While collecting data, we often need to process texts with HTML tags. In this article, we will discuss different ways to remove HTML tags from string in python.

Remove HTML tags from string in python Using Regular Expressions

Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.

The sub() method takes the pattern of the sub-string that needs to be replaced as its first argument, the string that will be substituted at the place of the replaced sub-string as the second input argument, and the original string as the third input argument.

After execution, it returns the modified string by replacing all the occurrences of the substring given as the first input argument with the substring given as the second input argument in the original string.

To remove HTML tags from string in python using the sub() method, we will first define a pattern that represents all the HTML tags. For this, we will create a pattern that reads all the characters inside an HTML tag <>.  The pattern is as follows.

After creating the pattern, we will substitute each substring having the defined pattern with an empty string "" using the sub() method. In this way, we can remove the HTML tags from any given string in Python.

Following is the source code to remove HTML tags from string in python using the sub() method. 

Output:

Remove HTML tags from string in python Using the lxml Module

Instead of using regular expressions, we can also use the lxml module to remove HTML tags from string in python. For this, we will first parse the original string using the fromstring() method.

The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml.etree._ElementUnicodeResult data type. Therefore, we need to convert the output to string using the str() function.

You can observe this in the following example.

Output:

Remove HTML tags from string in python Using the Beautifulsoup Module

Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

In this approach, we will first create a parser to parse the string that contains HTML tags using the BeautifulSoup() method. The BeautifulSoup() method takes the original string as its first input argument and the type of parser to be created as its second input argument, which is optional. After execution, it returns the parser. We can invoke the get_text() method on the parser to get the output string. 

The following program demonstrates how to remove HTML tags from string in python using the BeautifulSoup module.

Output:

Conclusion

In this article, we have discussed different ways to remove HTML tags from string in python. While the approaches with the lxml module and the BeautifulSoup modules create a parser to extract text from the HTML string, the approach using regular expressions focuses entirely on removing the HTML tags. Although the outputs are the same, this is the basic difference between these approaches. You can use any of the approaches according to your convenience.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Was this post helpful?


import_contacts

You may also like:

Related Posts

  • 23 September

    Find All Substrings of String in Python

    Table of ContentsHow To Find All Substrings of String in Python?Using a User-Defined Function that Makes Use of String Slicing to Find All Substrings of String in Python.Using List Comprehension Along with String Slicing to Find All Substrings of String in Python.Using the itertools.Combinations() Function to Find All Substrings of String in Python. Dealing with […]

  • 20 September

    Print Unicode Character in Python

    Table of ContentsPrint Unicode Character in Python 3Using the \u Escape Sequence to Print Unicode Character in Python.Using the utf-8 Encoding to Print Unicode Character in Python [Python 2].Conclusion Every traditional program is familiar with the ASCII table. This table represents a sequence of 128 characters where each character is represented by some number. However, […]

  • 19 May

    Remove First and Last Character of String in Python

    Table of ContentsHow to Remove First and Last Character of string in python?Remove First and Last Character of string Using For LoopRemove First and Last Character of string Using SlicingRemove First and Last word of string in PythonConclusion We use strings in python to handle text data. For processing the data, we sometimes need to […]

  • 19 May

    Remove xa0 from String in Python

    Table of ContentsHow to remove xa0 from string in python?Using the decode() function to remove xa0 from string in python.Using the re library to remove xa0 from string in python.Using the normalize() function from the unicodedata library to remove xa0 from string in python.Using the get_text() function from the BeautifulSoup library to remove xa0 from […]

  • 18 May

    Convert bool to String in Python

    Table of ContentsUsing the str() function to convert bool to string in Python.Using string formatting with the help of the % operator to convert bool to string in Python.Using string formatting with the help of the format() function to convert bool to string in Python.Using f-strings to convert bool to string in Python. In this […]

  • 18 May

    Convert UUID to String in Python

    Table of ContentsUUID in PythonWays to convert UUID to string in PythonUsing the str() function to convert UUID to String in PythonUsing the hex attribute to convert UUID to String in PythonUsing the urn attribute to convert UUID to String in PythonConclusion In this post, we will see how to convert UUID to String in […]

Leave a Reply

Your email address will not be published.

Subscribe to our newletter

Get quality tutorials to your inbox. Subscribe now.