Remove Urls from Text in Python

Remove Urls from String in Python

In this post, we will see how to remove Urls from text in Python.

Introduction

In Python, we can read and process text data. We can perform various operations on such texts using different libraries. In this tutorial, we will learn how to remove URLs from text in Python.

A URL is a link for any given resource on the internet. A URL is unique for every resource but they all follow the same structure. A URL will be different in every text and a given text may contain so we need to first identify the URL from its format and remove it.

For this, we can use Regular Expressions (regex). Regex is a technique that can create patterns that can identify some substring from a string. Since every URL shares the same structure, we can create a regex pattern that can identify the URL from a string.

We have to use the re module to work with regular expressions in Python.

Ways to remove URLs from Text in Python

This tutorial will demonstrate different methods from the re module that can be used to remove URLs from text in Python.

Using the re.sub() function to remove URLs from Text in Python

The re.sub() function provides the most straightforward approach to remove URLs from text in Python.

This function is used to substitute a given substring with another substring in any provided string. It uses a regex pattern to find the substring and then replace it with the provided substring.

To remove URLs from text in Python we can use this function with many patterns. We will demonstrate different possible regex patterns that can identify the URLs in our example.

See the code below.

Output:

This is a text with a URL to remove.
This is a text with a URL to remove.
This is a text with a URL to remove.

In the above example, we used three patterns to detect and remove URLs from text in Python. One can use whatever pattern works for their code. For our example, all three work. We will use only one pattern in the following examples.

Using the re.findall() function to remove URLs from Text in Python

The re.findall() function is used to find the total occurrences of a substring in a given string based on a regex pattern. It returns a list of all the occurrences of the substring.

We can use this function to find the URLs in a given string and then remove them using the replace() function. With the replace() function, we will replace the occurrence of the given URL with an empty string.

See the code below.

Output:

This is a text with a URL to remove.

Using the re.search() function to remove URLs from Text in Python

We can also use the re.match() and re.search() function to find a substring based on the regex pattern. However, both these functions only return the first occurrence of the substring. So, if a string contains more than one URL, these methods will fail.

Another downside of the re.match() function is that it only searches the first line of the string. So, if we have a string with only one URL, we can use the re.search() function.

The matched substring is returned in a match object.

See the code below.

Output:

This is a text with a URL to remove.

Using the urllib.urlparse class to remove URLs from Text in Python

In Python, we can send requests to a given address using modules like urllib, requests, and more. With the urllib.urlparse class, we can parse URLs and break them into components.

The urllib.parse object parses a URL string. We can use the scheme attribute of this object to check whether a string matches the structure of a URL or not.

To remove URLs from text in Python with this method, we will first break the text into a list of strings. This can be achieved using the split() function that can split strings into a list of strings based on some character.

We will then use the scheme attribute to check if each string in the list matches a URL or not. If the match is True, we will ignore that string. Finally, we will combine the remaining elements of the list using the join() function.

See this logic implemented below.

Output:

This is a text with a URL to remove.

This is the only method that does not use any regex.

Conclusion

To conclude, we discussed several methods to remove URLs from text in Python. Most of the methods used regular expressions to detect and replace the URL from a string with an empty string. The final method involves the urllib.urlparse module does not use regex and uses other functions within.

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *