Remove HTML Tags From String in Python

While collecting data, we often need to process texts with HTML tags. In this article, we will discuss different ways to remove HTML tags from string in python.

Remove HTML tags from string in python Using Regular Expressions

Regular expressions are one of the best ways to process text data. We can also remove HTML tags from string in python using regular expressions. For this, we can use the sub() method defined in the regex module.

The sub() method takes the pattern of the sub-string that needs to be replaced as its first argument, the string that will be substituted at the place of the replaced sub-string as the second input argument, and the original string as the third input argument.

After execution, it returns the modified string by replacing all the occurrences of the substring given as the first input argument with the substring given as the second input argument in the original string.

To remove HTML tags from string in python using the sub() method, we will first define a pattern that represents all the HTML tags. For this, we will create a pattern that reads all the characters inside an HTML tag <>.  The pattern is as follows.

After creating the pattern, we will substitute each substring having the defined pattern with an empty string "" using the sub() method. In this way, we can remove the HTML tags from any given string in Python.

Following is the source code to remove HTML tags from string in python using the sub() method. 

Output:

Remove HTML tags from string in python Using the lxml Module

Instead of using regular expressions, we can also use the lxml module to remove HTML tags from string in python. For this, we will first parse the original string using the fromstring() method.

The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml.etree._ElementUnicodeResult data type. Therefore, we need to convert the output to string using the str() function.

You can observe this in the following example.

Output:

Remove HTML tags from string in python Using the Beautifulsoup Module

Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

In this approach, we will first create a parser to parse the string that contains HTML tags using the BeautifulSoup() method. The BeautifulSoup() method takes the original string as its first input argument and the type of parser to be created as its second input argument, which is optional. After execution, it returns the parser. We can invoke the get_text() method on the parser to get the output string. 

The following program demonstrates how to remove HTML tags from string in python using the BeautifulSoup module.

Output:

Conclusion

In this article, we have discussed different ways to remove HTML tags from string in python. While the approaches with the lxml module and the BeautifulSoup modules create a parser to extract text from the HTML string, the approach using regular expressions focuses entirely on removing the HTML tags. Although the outputs are the same, this is the basic difference between these approaches. You can use any of the approaches according to your convenience.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Was this post helpful?


import_contacts

You may also like:

Related Posts

  • 19 May

    Remove First and Last Character of String in Python

    Table of ContentsHow to Remove First and Last Character of string in python?Remove First and Last Character of string Using For LoopRemove First and Last Character of string Using SlicingRemove First and Last word of string in PythonConclusionWas this post helpful? We use strings in python to handle text data. For processing the data, we […]

  • 19 May

    Remove xa0 from String in Python

    Table of ContentsHow to remove xa0 from string in python?Using the decode() function to remove xa0 from string in python.Using the re library to remove xa0 from string in python.Using the normalize() function from the unicodedata library to remove xa0 from string in python.Using the get_text() function from the BeautifulSoup library to remove xa0 from […]

  • 18 May

    Convert bool to String in Python

    Table of ContentsUsing the str() function to convert bool to string in Python.Using string formatting with the help of the % operator to convert bool to string in Python.Using string formatting with the help of the format() function to convert bool to string in Python.Using f-strings to convert bool to string in Python.Was this post […]

  • 18 May

    Convert UUID to String in Python

    Table of ContentsUUID in PythonWays to convert UUID to string in PythonUsing the str() function to convert UUID to String in PythonUsing the hex attribute to convert UUID to String in PythonUsing the urn attribute to convert UUID to String in PythonConclusionWas this post helpful? In this post, we will see how to convert UUID […]

  • 18 May

    How to Replace Word in String in Python

    Table of ContentsUsing the replace() Function to replace word in string in PythonUsing the re.Sub() Function to replace word in string in PythonUsing the re.subn() function to replace word in string in PythonConclusionWas this post helpful? In this tutorial, we will to how to replace word in string in Python. Strings represent a sequence of […]

  • 18 May

    Escape Curly Brace in f-String in Python

    Table of ContentsUse of the f-string in PythonHow to escape curly brace in f-string in PythonConclusionWas this post helpful? In this tutorial, we will see how to escape curly brace in f-string in Python. Use of the f-string in Python In Python, we can format strings using different methods to get the final result in […]

Leave a Reply

Your email address will not be published.

Subscribe to our newletter

Get quality tutorials to your inbox. Subscribe now.