Table of Contents
In this post, we will see how to remove Urls from text in Python.
Introduction
In Python, we can read and process text data. We can perform various operations on such texts using different libraries. In this tutorial, we will learn how to remove URLs from text in Python.
A URL is a link for any given resource on the internet. A URL is unique for every resource but they all follow the same structure. A URL will be different in every text and a given text may contain so we need to first identify the URL from its format and remove it.
For this, we can use Regular Expressions (regex). Regex is a technique that can create patterns that can identify some substring from a string. Since every URL shares the same structure, we can create a regex pattern that can identify the URL from a string.
We have to use the re
module to work with regular expressions in Python.
Ways to remove URLs from Text in Python
This tutorial will demonstrate different methods from the re
module that can be used to remove URLs from text in Python.
Using the re.sub()
function to remove URLs from Text in Python
The re.sub()
function provides the most straightforward approach to remove URLs from text in Python.
This function is used to substitute a given substring with another substring in any provided string. It uses a regex pattern to find the substring and then replace it with the provided substring.
To remove URLs from text in Python we can use this function with many patterns. We will demonstrate different possible regex patterns that can identify the URLs in our example.
See the code below.
1 2 3 4 5 6 7 8 9 10 |
import re t ="This is a text with a URL https://www.java2blog.com/ to remove." s1 = re.sub('http://\S+|https://\S+', '', t) s2 = re.sub('http[s]?://\S+', '', t) s3 = re.sub(r"http\S+", "", t) print(s1) print(s2) print(s3) |
Output:
This is a text with a URL to remove.
This is a text with a URL to remove.
In the above example, we used three patterns to detect and remove URLs from text in Python. One can use whatever pattern works for their code. For our example, all three work. We will use only one pattern in the following examples.
Further reading:
Using the re.findall()
function to remove URLs from Text in Python
The re.findall()
function is used to find the total occurrences of a substring in a given string based on a regex pattern. It returns a list of all the occurrences of the substring.
We can use this function to find the URLs in a given string and then remove them using the replace()
function. With the replace()
function, we will replace the occurrence of the given URL with an empty string.
See the code below.
1 2 3 4 5 6 7 8 |
import re t ="This is a text with a URL https://www.java2blog.com/ to remove." lst = re.findall('http://\S+|https://\S+', t) for i in lst: s1 = t.replace(i, '') print(s1) |
Output:
Using the re.search()
function to remove URLs from Text in Python
We can also use the re.match()
and re.search()
function to find a substring based on the regex pattern. However, both these functions only return the first occurrence of the substring. So, if a string contains more than one URL, these methods will fail.
Another downside of the re.match()
function is that it only searches the first line of the string. So, if we have a string with only one URL, we can use the re.search()
function.
The matched substring is returned in a match object.
See the code below.
1 2 3 4 5 6 7 8 |
import re t ="This is a text with a URL https://www.java2blog.com/ to remove." m = re.search('http://\S+|https://\S+', t) i = m.group(0) s1 = t.replace(i, '') print(s1) |
Output:
Using the urllib.urlparse
class to remove URLs from Text in Python
In Python, we can send requests to a given address using modules like urllib
, requests
, and more. With the urllib.urlparse
class, we can parse URLs and break them into components.
The urllib.parse
object parses a URL string. We can use the scheme
attribute of this object to check whether a string matches the structure of a URL or not.
To remove URLs from text in Python with this method, we will first break the text into a list of strings. This can be achieved using the split()
function that can split strings into a list of strings based on some character.
We will then use the scheme
attribute to check if each string in the list matches a URL or not. If the match is True, we will ignore that string. Finally, we will combine the remaining elements of the list using the join()
function.
See this logic implemented below.
1 2 3 4 5 6 7 |
from urllib.parse import urlparse t ="This is a text with a URL https://www.java2blog.com/ to remove." lst = [l for l in t.split() if not urlparse(l).scheme] s = ' '.join(lst) print(s) |
Output:
This is the only method that does not use any regex.
Conclusion
To conclude, we discussed several methods to remove URLs from text in Python. Most of the methods used regular expressions to detect and replace the URL from a string with an empty string. The final method involves the urllib.urlparse
module does not use regex and uses other functions within.