Table of Contents
In real-life scenarios, we deal with massive datasets with many rows and columns. At times, we may want to split a large DataFrame into smaller DataFrames.
We will discuss different methods to split dataframe in Python.
Using the iloc()
function to split DataFrame in Python
Slicing is a method of extracting a smaller number of elements from a larger structure. We can use the iloc()
function to slice DataFrames into smaller DataFrames. The iloc()
function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.
By Rows
We can select the required range of rows from the DataFrame using the iloc()
function.
See the following example.
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd df = pd.DataFrame([['Jay','M',18],['Jennifer','F',17], ['Preity','F',19],['Neil','M',17]], columns = ['Name','Gender','Age']) df1 = df.iloc[2:,:] df2 = df.iloc[:2,:] print(df1) print(df2) |
Output:
1 2 3 4 5 6 7 8 |
Name Gender Age 2 Preity F 19 3 Neil M 17 Name Gender Age 0 Jay M 18 1 Jennifer F 17 |
In the above example, we split the DataFrame based on rows. It is useful to know the total rows and columns in the DataFrame while using this method.
By Columns
We can similarly split the DataFrame based on columns also. We can specify the range for the columns.
For example,
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd df = pd.DataFrame([['Jay','M',18],['Jennifer','F',17], ['Preity','F',19],['Neil','M',17]], columns = ['Name','Gender','Age']) df1 = df.iloc[:,:2] df2 = df.iloc[:,2:] print(df1) print(df2) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 |
Name Gender 0 Jay M 1 Jennifer F 2 Preity F 3 Neil M Age 0 18 1 17 2 19 3 17 |
Using the sample()
function to split DataFrame in Python
The sample()
function returns a random sample of values from a DataFrame. We can extract elements from the required axis. The ratio of the sample can be specified in the function.
For example,
1 2 3 4 5 6 7 8 |
import pandas as pd df = pd.DataFrame([['Jay','M',18],['Jennifer','F',17], ['Preity','F',19],['Neil','M',17]], columns = ['Name','Gender','Age']) df1 = df.sample(frac = 0.75) print(df1) |
Output:
1 2 3 4 5 6 |
Name Gender Age 0 Jay M 18 2 Preity F 19 3 Neil M 17 |
The ratio specified in the above example is 0.75. We can use other parameters like random_state
and weights
to have some control on the final result.
This method is highly used while dividing the DataFrame into test and train datasets in machine learning.
Using the groupby()
function to split DataFrame in Python
The groupby()
function is used to split the DataFrame based on some values. We can first split the DataFrame and extract specific groups using the get_group()
function.
This method works best when we want to split a DataFrame based on some column that has categorical values.
For example,
1 2 3 4 5 6 7 8 |
import pandas as pd df = pd.DataFrame([['Jay','M',18],['Jennifer','F',17], ['Preity','F',19],['Neil','M',17]], columns = ['Name','Gender','Age']) gr = df.groupby('Gender') print(gr.get_group('F')) |
Output:
1 2 3 4 5 |
Name Gender Age 1 Jennifer F 17 2 Preity F 19 |
In the above example, we grouped the DataFrame using the Gender
column and extracted the rows where the value for this column is F
.
Using the columns to split DataFrame in Python
We can specify the labels or index of the required columns in a list to extract those columns from the DataFrame.
For example,
1 2 3 4 5 6 7 8 |
import pandas as pd df = pd.DataFrame([['Jay','M',18],['Jennifer','F',17], ['Preity','F',19],['Neil','M',17]], columns = ['Name','Gender','Age']) df1 = df[['Name','Gender']] print(df1) |
Output:
1 2 3 4 5 6 7 |
Name Gender 0 Jay M 1 Jennifer F 2 Preity F 3 Neil M |
That’s all about how to split dataframe in Pandas.