The Python Oracle

Import multiple CSV files into pandas and concatenate into one DataFrame

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Fantascape Looping

--

Chapters
00:00 Question
00:46 Accepted answer (Score 813)
03:31 Answer 2 (Score 364)
03:51 Answer 3 (Score 114)
04:02 Answer 4 (Score 85)
05:26 Thank you

--

Full question
https://stackoverflow.com/questions/2090...

Accepted answer links:
[pandas: IO tools]: https://pandas.pydata.org/docs/user_guid...
[Sid]: https://stackoverflow.com/users/3888455/...
[pathlib]: https://docs.python.org/3/library/pathli...
[np.repeat]: https://numpy.org/doc/stable/reference/g...
[answer]: https://stackoverflow.com/a/65951915/775...
[.assign]: https://pandas.pydata.org/docs/reference...
[C8H10N4O2]: https://stackoverflow.com/users/2573061/...

Answer 2 links:
[darindaCoder's answer]: https://stackoverflow.com/a/21232849/388...

Answer 4 links:
[Python's map (function, iterable)]: https://docs.python.org/3.5/library/func...
[read_csv()]: https://pandas.pydata.org/pandas-docs/st...
[concat()]: http://pandas.pydata.org/pandas-docs/sta...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #pandas #csv #dataframe #concatenation

#avk47



ACCEPTED ANSWER

Score 893


See pandas: IO tools for all of the available .read_ methods.

Try the following code if all of the CSV files have the same columns.

I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.

import pandas as pd
import glob
import os

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Or, with attribution to a comment from Sid.

all_files = glob.glob(os.path.join(path, "*.csv"))

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

  • It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
  • pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.

Imports and Setup

from pathlib import Path
import pandas as pd
import numpy as np

path = r'C:\DRO\DCL_rawdata_files'  # or unix / linux / mac path

# Get the files from the path provided in the OP
files = Path(path).glob('*.csv')  # .rglob to get subdirectories

Option 1:

  • Add a new column with the file name
dfs = list()
for f in files:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['file'] = f.stem
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 2:

  • Add a new column with a generic name using enumerate
dfs = list()
for i, f in enumerate(files):
    data = pd.read_csv(f)
    data['file'] = f'File {i}'
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 3:

  • Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
    • [f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
    • [len(df) for df in dfs] creates a list of lengths
  • Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]

# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)

# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])

Option 4:

  • One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)

or

df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)



ANSWER 2

Score 369


An alternative to darindaCoder's answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one



ANSWER 3

Score 121


import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))



ANSWER 4

Score 97


Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional third-party libraries. You can do this in two lines using everything Pandas and Python (all versions) already have built in.

For a few files - one-liner

df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

For many files

import os

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

For No Headers

If you have specific things you want to change with pd.read_csv (i.e., no headers) you can make a separate function and call that with your map:

def f(i):
    return pd.read_csv(i, header=None)

df = pd.concat(map(f, filepaths))

This pandas line, which sets the df, utilizes three things:

  1. Python's map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every CSV element in filepaths).
  2. Panda's read_csv() function reads in each CSV file as normal.
  3. Panda's concat() brings all these under one df variable.