Table of Contents
1. Introduction
One of the common tasks that pandas users face is to remove all non-numeric characters from a column or a dataframe. Non-numeric characters are any characters that are not digits, such as letters, punctuation, symbols, or whitespace. Removing non-numeric characters can help to clean the data and prepare it for further processing or analysis. In this article, we will show you how to remove all non-numeric characters in pandas using various methods and functions.
2. Using Pandas Series.str.extract() Method
To remove all non-numeric characters from one column in pandas:
- After creating the data frame, use the
Series.str.extract()
method to extract numeric parts.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df['C'] = df['C'].str.extract(pat='(\d+)', expand=False) print(df) |
1 2 3 4 5 6 |
A B C 0 Blue78 Triangle 235 1 Red Cylinder34 523 2 Yellow45 Circle 657 |
In this code, we used the Series.str.extract()
method to extract numeric parts from one column in pandas, which indirectly means we removed all non-numeric characters from that column.
Here, the Series.str
was used to retrieve values of the Series and then apply the .extract()
method to it. Here, the series is a column of the data frame.
The .extract()
method retrieved capture groups in the regular expression pattern as columns in the data frame. For every string in the series, it extracted groups from the initial match of the regex pattern.
The first parameter of the .extract()
method is pat
, a regex pattern. It denoted the capturing groups; in our example, the capturing groups were numbers only. How? Let’s understand the regular expression below:
(
– Begin the capture group.\d
– It is the shorthand character class, used to match numbers only. It is the same as we use regex as[0-9]
.+
– It allows us to match one or multiple of the specified expression.)
– Stop a capture group.
The second parameter, expand
, was set to False
to get series as output and updated the original values by assigning them to df['C']
. If we set the value for this parameter to True
, we will get a data frame with one column per capture group as an output.
We have successfully removed all non-numeric characters from column C
of the data frame, but each value in column C
is still of string data type. In case we want to cast the type of the values to int
, then we can do it using astype()
as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df['C'] = df['C'].str.extract(pat='(\d+)', expand=False) df['C'] = df['C'].astype(int) print(df) |
The astype()
is used to cast pandas’ object to the specified data type. This convenient method allows us to cast a particular data frame column from one data type to another. For example, the above code fence used astype()
to cast column C
datatype from str
to int
data type.
3. Using Series.str.replace() Method
In the previous approach, we extracted the numbers and assigned them back to the respective column of a data frame. Here, we will replace all the non-numeric characters rather than extracting all the numbers.
So, to remove all non-numeric characters from one column in pandas:
- After creating a data frame, use the
Series.str.replace()
method to replace non-numeric characters.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df['C'] = df['C'].str.replace('(\D+)', '') print(df) |
1 2 3 4 5 6 |
A B C 0 Blue78 Triangle 235 1 Red Cylinder34 523 2 Yellow45 Circle 657 |
We have already learned about Series.str
while using the Series.str.extract()
method, so let’s move ahead with the .replace()
method. It replaced every occurrence of the specified regex/pattern in series.
The .replace()
method took two parameters. First is the regular expression whose every occurrence was replaced with the second parameter’s value.
This regular expression is the same as we used in Series.str.extract()
except for one difference. Here, we used \D
, which is used to match all the characters that are not decimal digits. So it is the opposite of \d
and the same as we use regex as [^0-9]
.
Note that we can also write regex as \D+
instead of (\D+)
. Additionally, this code can be used with raw string as follows:
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df['C'] = df['C'].str.replace(r'\D+', '') print(df) |
1 2 3 4 5 6 |
A B C 0 Blue78 Triangle 235 1 Red Cylinder34 523 2 Yellow45 Circle 657 |
The only difference was the r
prefix in the specified pattern. Here, r
means the string was to be treated as the raw string. Now, what is the raw string? It means it would ignore all escape codes. For instance, '\n'
will be treated as a newline character while r'\n'
as characters \
followed by n
.
4. Using re.sub() with apply() Method
To remove all non-numeric characters from one column in pandas:
- Define a function named
remove_non_numberics()
, which takes a string type value as an argument. - Inside the
remove_non_numberics()
function, use there.sub()
method to remove all non-numeric characters and return the final results. - Use the
.apply()
method to apply theremove_non_numberics()
function to each value of the series (a column of a data frame) in pandas.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import pandas as pd import re df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) def remove_non_numberics(s): return re.sub('[^0-9]+', '', s) df['C'] = df['C'].apply(remove_non_numberics) print(df) |
1 2 3 4 5 6 |
A B C 0 Blue78 Triangle 235 1 Red Cylinder34 523 2 Yellow45 Circle 657 |
For this section, we created a function and named it remove_non_numberics()
; it took a string type value as an argument which is further passed to the re.sub()
method.
The re.sub()
method took three arguments. First was the pattern that needed to be replaced, second was the pattern/string with which it would be replaced, and third was the string on which re.sub()
was executed.
Finally, the remove_non_numberics()
function returned the string output of the re.sub()
method, where a replace string replaced all the matching occurrences of a given pattern.
Note that we executed remove_non_numberics()
for one string value till this point. To apply this function for every value of a column C
in a data frame, we used the.apply()
method by passing remove_non_numberics
as a parameter.
5. Using lambda Function
To remove all non-numeric characters from one column in pandas:
- Use the
lambda
function to have all the values of a data frame column as arguments. - Use a list comprehension to iterate over each argument character by character and use the
if
statement with theisdigit()
method to check whether the character is a digit. - If the character is a digit, join them using the
.join()
method. - Once we are done with the
lambda
function, use the.map()
method to map new values with the old values of columnC
in the data frame.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df['C'] = df['C'].map(lambda i: ''.join([x for x in i if x.isdigit()])) print(df) |
1 2 3 4 5 6 |
A B C 0 Blue78 Triangle 235 1 Red Cylinder34 523 2 Yellow45 Circle 657 |
This code used the lambda
function, also known as an anonymous function; it took all the values of column C
as arguments.
For each argument, we used list comprehension to iterate over the argument character by character. In addition, we used the if
condition with the isdigit()
method to assess if the current character is a digit.
If the character was a digit, we passed it to the .join()
method, which joined all the numbers. Finally, we used the .map()
method to map column C
‘s original values with new values, which are now numbers only.
NOTE: We have learned how to eliminate all non-numeric characters from a column in pandas. What if we want to remove non-numeric characters from the whole data frame?
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd df = pd.DataFrame({ 'A': ['Blue78', 'Red', 'Yellow45'], 'B': ['Triangle', 'Cylinder34', 'Circle'], 'C': ['Smalles235', 'Small523', 'Big657'] }) df = df.replace(regex=[r'\D+'], value="") print(df) |
1 2 3 4 5 6 |
A B C 0 78 235 1 34 523 2 45 657 |
To remove from columns A
and B
only, replace the following:
df = df.replace(regex=[r'\D+'], value="")
With
df[['A', 'B']] = df[['A', 'B']].replace(regex=[r'\D+'], value="")
Remember, we have already learned all the expressions and functions used in the above code snippet.