Table of Contents
1. Overview
Non-alphanumeric characters are those that are not letters or numbers, such as punctuation marks, symbols, spaces, or special characters. Sometimes, we may want to remove these characters from String in Python, for example, to clean user input, extract data etc.
2. Introduction to Problem
Let’s say we have a string:
1 2 3 |
text = "456-abc78!@#$%9AB,C" |
As we can see, the string text
contains digits, letters, and various punctuation marks, symbols.
Our goal is to remove Non-alphanumeric characters from the string and leave letters, digits in the result:
1 2 3 |
456abc789ABC |
In this article, we will explore some ways to remove non-alphanumeric characters from a string in Python using built-in methods, regular expressions.
3. Using Regular Expression
Use regular expressions with the re.sub()
method to remove non-alphanumeric characters from the specified string in Python.
1 2 3 4 5 6 |
import re text = "456abc789ABC-{}[]:*(!@#$%-=\"<>?/~,^&*().`" extracted_string = re.sub(r'\W+', '', text) print(extracted_string) |
1 2 3 |
456abc789ABC |
Here, we used re.sub()
, a built-in method of the re
module we imported earlier.
The re.sub()
method can take the following parameters, whose brief description is given below:
pattern
– It is the regular expression we want to match. It can also be aPattern
object.repl
– This parameter contains the replacement string.string
– It holds the actual input string.count
– This parameter is used to specify the maximum number of replacements, or we can say a maximum number of matches we want to replace using there.sub()
method.flag
– It represents one or multiple regex flags, used to manipulate thepattern
‘s standard behaviour.
However, we used the first three parameters with the re.sub()
method to replace all the occurrences of non-alphanumeric characters.
Now, the point is what regular expression we use and which string value was used as a replacement? We used \W
as regex, which matched non-word characters such as $
, &
, etc. For every non-word character, we used an empty string as a replacement while string
was the actual input string.
The
\w
(lowercasew
) is used to match theword
characters such as a digit, a letter, and an underscore ([a-zA-z0-9_]
) while\W
(uppercaseW
) is used to match thenon-word
characters which include anything excluding[a-zA-z0-9_]
.
4. Using isalnum() Method
To remove non-alphanumeric characters in Python:
- Use the
filter()
method withstr.isalnum
to check if the string contains alphanumeric characters and filter them. - Use the
join()
method to join all the characters of the filtered string received from the previous step.
1 2 3 4 5 |
text = "456abc789ABC-{}[]:*(!@#$%-=\"<>?/~,^&*().`" extracted_string = ''.join(filter(str.isalnum, text)) print(extracted_string) |
1 2 3 |
456abc789ABC |
Here, we used the exact input string. So let’s learn the code step by step to understand it clearly. First, we used filter() method, which took two arguments, the str.isalnum
function and the string
iterable.
The filter()
method selected one item at a time from the string
iterable, applied the str.isalnum
function that we passed as the first argument and selected an item from the string
iterable based on the output of the str.isalnum
function. If the str.isalnum
returns True
, the current item will be selected by the filter()
method; otherwise, it will not.
Now, what the str.isalnum
function was doing? We used this method to determine if the string
contains alphanumeric characters.
The filter()
method returned an iterator after going through the whole string
, which was then passed to the .join()
method as an argument. The .join()
method joined all the characters without any separator because we used an empty string as a delimiter.
The following solution is an alternative to the above code, but we used the for
loop to check all the characters of the string
and see if that is alphanumeric in Python.
1 2 3 4 5 |
string = "456abc789ABC-{}[]:*(!@#$%-=\"<>?/~,^&*().`" extracted_string = ''.join(c for c in string if c.isalnum()) print(extracted_string) |
1 2 3 |
456abc789ABC |
Further reading:
5. Using str.translate() Method
To remove non-alphanumeric characters in Python:
- Use the
str.maketrans()
method to get a mapping table. - Use the
string.translate()
method to replace one character with the other in the mapping table we created in the previous step.
1 2 3 4 5 6 7 |
import string my_string = "456abc789ABC-{}[]:*(!@#$%-=\"<>?/~,^&*().`" mapping_table = str.maketrans('', '', string.punctuation) extracted_string = my_string.translate(mapping_table) print(extracted_string) |
1 2 3 |
456abc789ABC |
First, we imported the string
module to use functions to process the standard Python strings. Next, we used str.maketrans()
to make mapping tables (also referred to as translation table), which was then passed to the my_string.translate()
method as a parameter.
The str.maketrans() function takes three arguments: a string of characters to map, a string of characters to map to, and a string of characters to delete. In this case, we use empty strings for the first two arguments, and a string that contains all alphanumeric characters for the third argument. This creates a translation table that maps all alphanumeric characters to None, which means they will be deleted from the input string. Then, we use the str.translate() method to apply the translation table to the input string 3.
6. Performance
Let’s compare performance of above mentioned methods. We will create separate function for each method and run it 1000 times for big string.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import re import string import timeit def remove_non_alnum_re(s): return re.sub(r'\W+', '', s) def remove_non_alnum_isalnum(s): return ''.join(c for c in s if c.isalnum()) def remove_non_alnum_translate(s): table = str.maketrans('', '', string.punctuation) return s.translate(table).replace(' ', '') big_string = 'This is a big string with some non-alphanumeric characters!@#$%^&*()_+{}[]|\/<>?`~' # Run each function 1000 times and print the average time print('re.sub:', timeit.timeit(lambda: remove_non_alnum_re(big_string), number=1000)) print('isalnum:', timeit.timeit(lambda: remove_non_alnum_isalnum(big_string), number=1000)) print('translate:', timeit.timeit(lambda: remove_non_alnum_translate(big_string), number=1000)) |
1 2 3 4 5 |
re.sub: 0.002696799999999999 isalnum: 0.001636900000000001 translate: 0.0010567000000000007 |
This means that the translate() method is the fastest, followed by the isalnum() method, and the re.sub() method is the slowest. However, the difference is not very significant, and the choice of method may depend on other factors, such as readability, compatibility, or special cases.
7. Conclusion
In this article, we explored three different methods to remove Non-alphanumeric Characters in Python: using regular expressions, using the isalnum() method, and using the str.translate() method. We discussed each method, and showed some examples of how to apply them.