Using pd.concat()
Method
To merge multiple CSV files in Python:
- Create a list containing all CSV files we will merge.
- Create an empty dataframe.
- Use the
for
loop to iterate over the files. - In each iterate, use
read_csv()
to read CSV files and concatenate using thepd.concat()
method. - Use
to_csv()
to write merged data into a new CSV file.
1 2 3 4 5 6 7 8 9 |
import pandas as pd files = ['file1.csv', 'file2.csv'] df = pd.DataFrame() for file in files: data = pd.read_csv(file) df = pd.concat([df, data], axis=0) df.to_csv('merged_files.csv', index=False) |
Let’s first see content of file1.csv and file2.csv to understand output of above code.
1 2 3 4 5 6 |
ID,FirstName,LastName 1,Sara,Ali 2,John,Johnas 3,Nayya,Daniel |
1 2 3 4 5 |
ID,FirstName,LastName 1,Sara,Ahmed 5,Daniel,Christopher |
1 2 3 4 5 6 7 8 |
ID,FirstName,LastName 1,Sara,Ali 2,John,Johnas 3,Nayya,Daniel 1,Sara,Ahmed 5,Daniel,Christopher |
We created a list having all CSV files; having a list is useful when we are required to merge multiple CSV files. We created an empty DataFrame to contain the data returned by the read_csv()
method, which took a file’s name as an argument in every iteration of the for
loop.
Also, we used the .concat()
method in each iteration of the for
loop to concatenate the data
into the DataFrame df
. Finally, we used the to_csv()
file to write the new DataFrame to merged.files.csv
.
NOTE: In the previous pandas
versions, we could use the df.append()
method instead of pd.concat()
, but that is deprecated since version 1.4.0. So, we can use the pd.concat()
or the df.merge()
method.
You may have observed in the above-provided solutions that we either read the CSV file one by one or create a list with all CSV files that we read using the for
loop. If you are looking for a solution to set up your working directory, you can choose one of the following methods.
Setting Up the Working Directory
We can set up the working directory by using the following three ways.
1 2 3 4 5 6 |
import os import pandas as pd file_path = "C:\Downloads\Merge_Files\csv_files" file_list = os.listdir(file_path) |
Here, we gave the directory’s path into a variable, which lists every file. For example, in file_path
, we passed the CSV files path, and in the file_list
, we listed all the files in a directory by providing the file path parameter.
1 2 3 4 5 6 |
import os import pandas as pd cwd = os.path.abspath('') file_list = os.listdir(cwd) |
We can use the code above to read files from the same directory as your .ipynb
file; an IPYNB
file is a notebook document created by Jupyter Notebook, an interactive computational environment that helps scientists manipulate and analyze data using Python.
1 2 3 4 5 6 |
import os import glob import pandas as pd csv_files = glob.glob('*.{}'.format('csv')) |
The glob()
function can only list the CSV files in the working directory given above. The glob
module in Python finds files and pathnames that match a given pattern. ‘glob`’s pattern rules follow standard Unix path expansion rules.
Additionally, it is projected that benchmarks show it to be quicker than previous ways to match pathnames in directories. With glob, we used wildcard (*
) apart from the exact string search to make path retrieval more simple and convenient.
Using pd.merge()
Method
To merge multiple CSV files in Python:
- Use
pd.read_csv()
to read all CSV files one by one. - Use the
pd.merge()
method to merge all the CSV files.
1 2 3 4 5 6 7 |
import pandas as pd df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') merged_df = pd.merge(df1, df2, on='ID', how='inner') merged_df.to_csv('merged_files.csv', index=False) |
1 2 3 4 |
ID,FirstName_x,LastName_x,FirstName_y,LastName_y 1,Sara,Ali,Sara,Ahmed |
After importing the pandas
library, we used its alias pd
to avoid keystrokes. Then, we used the .read_csv()
method to read the specified CSV file (file1.csv
) and saved it into a data frame called df1
. Similarly, we read file2.csv
, but this time we saved it to another DataFrame df2
. Next, we used the .merge()
method of the pandas
library to merge both DataFrames based on the ID
column.
The how
parameter is used to specify the type of merge to be performed. In this case, the how='inner'
denotes that only the rows with matching values in the ID
column will be included in the merged DataFrame.
Next, we used the merged_df.to_csv('merged_files.csv', index=False)
method to write the merged DataFrame to a new CSV file named merged_files.csv
and the index=False
option used to prevent the DataFrame index from being saved in the CSV file.
It’s worth noting that in the case of different column names in both data frames, you can use left_on
and right_on
to specify which columns to merge on.
That’s all about how to merge multiple CSV files in Python.