Remove Non-alphanumeric Characters in Python

Remove Non-alphanumeric Characters in Python

1. Overview

Non-alphanumeric characters are those that are not letters or numbers, such as punctuation marks, symbols, spaces, or special characters. Sometimes, we may want to remove these characters from String in Python, for example, to clean user input, extract data etc.

2. Introduction to Problem

Let’s say we have a string:

As we can see, the string text contains digits, letters, and various punctuation marks, symbols.

Our goal is to remove Non-alphanumeric characters from the string and leave letters, digits in the result:

In this article, we will explore some ways to remove non-alphanumeric characters from a string in Python using built-in methods, regular expressions.

3. Using Regular Expression

Use regular expressions with the re.sub() method to remove non-alphanumeric characters from the specified string in Python.

Here, we used re.sub(), a built-in method of the re module we imported earlier.

The re.sub() method can take the following parameters, whose brief description is given below:

  • pattern – It is the regular expression we want to match. It can also be a Pattern object.
  • repl – This parameter contains the replacement string.
  • string – It holds the actual input string.
  • count – This parameter is used to specify the maximum number of replacements, or we can say a maximum number of matches we want to replace using the re.sub() method.
  • flag – It represents one or multiple regex flags, used to manipulate the pattern‘s standard behaviour.

However, we used the first three parameters with the re.sub() method to replace all the occurrences of non-alphanumeric characters.

Now, the point is what regular expression we use and which string value was used as a replacement? We used \W as regex, which matched non-word characters such as $, &, etc. For every non-word character, we used an empty string as a replacement while string was the actual input string.

The \w (lowercase w) is used to match the word characters such as a digit, a letter, and an underscore ([a-zA-z0-9_]) while \W (uppercase W) is used to match the non-word characters which include anything excluding [a-zA-z0-9_].

4. Using isalnum() Method

To remove non-alphanumeric characters in Python:

  • Use the filter() method with str.isalnum to check if the string contains alphanumeric characters and filter them.
  • Use the join() method to join all the characters of the filtered string received from the previous step.

Here, we used the exact input string. So let’s learn the code step by step to understand it clearly. First, we used filter() method, which took two arguments, the str.isalnum function and the string iterable.

The filter() method selected one item at a time from the string iterable, applied the str.isalnum function that we passed as the first argument and selected an item from the string iterable based on the output of the str.isalnum function. If the str.isalnum returns True, the current item will be selected by the filter() method; otherwise, it will not.

Now, what the str.isalnum function was doing? We used this method to determine if the string contains alphanumeric characters.

The filter() method returned an iterator after going through the whole string, which was then passed to the .join() method as an argument. The .join() method joined all the characters without any separator because we used an empty string as a delimiter.

The following solution is an alternative to the above code, but we used the for loop to check all the characters of the string and see if that is alphanumeric in Python.

5. Using str.translate() Method

To remove non-alphanumeric characters in Python:

  • Use the str.maketrans() method to get a mapping table.
  • Use the string.translate() method to replace one character with the other in the mapping table we created in the previous step.

First, we imported the string module to use functions to process the standard Python strings. Next, we used str.maketrans() to make mapping tables (also referred to as translation table), which was then passed to the my_string.translate() method as a parameter.

The str.maketrans() function takes three arguments: a string of characters to map, a string of characters to map to, and a string of characters to delete. In this case, we use empty strings for the first two arguments, and a string that contains all alphanumeric characters for the third argument. This creates a translation table that maps all alphanumeric characters to None, which means they will be deleted from the input string. Then, we use the str.translate() method to apply the translation table to the input string 3.

6. Performance

Let’s compare performance of above mentioned methods. We will create separate function for each method and run it 1000 times for big string.

This means that the translate() method is the fastest, followed by the isalnum() method, and the re.sub() method is the slowest. However, the difference is not very significant, and the choice of method may depend on other factors, such as readability, compatibility, or special cases.

7. Conclusion

In this article, we explored three different methods to remove Non-alphanumeric Characters in Python: using regular expressions, using the isalnum() method, and using the str.translate() method. We discussed each method, and showed some examples of how to apply them.

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *