The Python Oracle

Python recursive folder read

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Lost Jungle Looping

--

Chapters
00:00 Python Recursive Folder Read
00:45 Accepted Answer Score 470
02:15 Answer 2 Score 45
02:35 Answer 3 Score 281
03:02 Answer 4 Score 41
05:35 Thank you

--

Full question
https://stackoverflow.com/questions/2212...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #scripting #fileio

#avk47



ACCEPTED ANSWER

Score 470


Make sure you understand the three return values of os.walk:

for root, subdirs, files in os.walk(rootdir):

has the following meaning:

  • root: Current path which is "walked through"
  • subdirs: Files in root of type directory
  • files: Files in root (not in subdirs) of type other than directory

And please use os.path.join instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file - you must concatenate the currently "walked" folder instead of the topmost folder. So that must be filePath = os.path.join(root, file). BTW "file" is a builtin, so you don't normally use it as variable name.

Another problem are your loops, which should be like this, for example:

import os
import sys

walk_dir = sys.argv[1]

print('walk_dir = ' + walk_dir)

# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))

for root, subdirs, files in os.walk(walk_dir):
    print('--\nroot = ' + root)
    list_file_path = os.path.join(root, 'my-directory-list.txt')
    print('list_file_path = ' + list_file_path)

    with open(list_file_path, 'wb') as list_file:
        for subdir in subdirs:
            print('\t- subdirectory ' + subdir)

        for filename in files:
            file_path = os.path.join(root, filename)

            print('\t- file %s (full path: %s)' % (filename, file_path))

            with open(file_path, 'rb') as f:
                f_content = f.read()
                list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
                list_file.write(f_content)
                list_file.write(b'\n')

If you didn't know, the with statement for files is a shorthand:

with open('filename', 'rb') as f:
    dosomething()

# is effectively the same as

f = open('filename', 'rb')
try:
    dosomething()
finally:
    f.close()



ANSWER 2

Score 281


If you are using Python 3.5 or above, you can get this done in 1 line.

import glob

# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
     print(filename)

As mentioned in the documentation

If recursive is true, the pattern '**' will match any files and zero or more directories and subdirectories.

If you want every file, you can use

import glob

for filename in glob.iglob(root_dir + '**/**', recursive=True):
     print(filename)



ANSWER 3

Score 45


Agree with Dave Webb, os.walk will yield an item for each directory in the tree. Fact is, you just don't have to care about subFolders.

Code like this should work:

import os
import sys

rootdir = sys.argv[1]

for folder, subs, files in os.walk(rootdir):
    with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
        for filename in files:
            with open(os.path.join(folder, filename), 'r') as src:
                dest.write(src.read())



ANSWER 4

Score 41


TL;DR: These are equivalents to find -type f, to go over all files in all folders below and including the current one:

folder = '.'

import os
for currentpath, folders, files in os.walk(folder):
    for file in files:
        print(os.path.join(currentpath, file))
## or:
import glob
for pathstr in glob.iglob(glob.escape(folder) + '/**/*', recursive=True):
    print(pathstr)

Comparing the two methods:

  • os.walk is about 3× faster
  • os.walk used slightly more memory in my test because the files array held 82k entries whereas glob returns an iterator and streams the results. The piecemeal handling of each result (more calls and less buffering going on) likely explains the speed difference
    • If you forget the i in glob.iglob(), it will return a list rather than an iterator and potentially use a lot more memory
  • os.walk will not silently give incomplete results or unexpectedly interpret a name as a matching pattern
  • glob doesn't show empty directories
  • glob needs to escape directory and file names using glob.escape(name) because they can contain special characters
  • glob excludes directories and files starting with a dot (e.g., ~/.bashrc or ~/.vim) and include_hidden does not solve that (it includes hidden folders only; you need to specify a second pattern for dotfiles)
  • glob doesn't tell you what is a file and what a directory
  • glob walks into symlinks and may lead to you enumerating a lot of files in completely different places (which may be what you want; in that case, os.walk has followlinks=True as an option)
  • os.walk lets you modify which paths to walk down by modifying the folders array while it's running, though personally this feels a bit messy and I'm not sure I would recommend that

Other answers already mentioned os.walk(), but it could be explained better. It's quite simple! Let's walk through this tree:

docs/
└── doc1.odt
pics/
todo.txt

With this code:

for currentpath, folders, files in os.walk('.'):
    print(currentpath)

The currentpath is the current folder it is looking at. This will output:

.
./docs
./pics

So it loops three times, because there are three folders: the current one, docs, and pics. In every loop, it fills the variables folders and files with all folders and files. Let's show them:

for currentpath, folders, files in os.walk('.'):
    print(currentpath, folders, files)

This shows us:

# currentpath  folders           files
.              ['pics', 'docs']  ['todo.txt']
./pics         []                []
./docs         []                ['doc1.odt']

So in the first line, we see that we are in folder ., that it contains two folders namely pics and docs, and that there is one file, namely todo.txt. You don't have to do anything to recurse into those folders, because as you see, it recurses automatically and just gives you the files in any subfolders. And any subfolders of that (though we don't have those in the example).

If you just want to loop through all files, the equivalent of find -type f, you can do this:

for currentpath, folders, files in os.walk('.'):
    for file in files:
        print(os.path.join(currentpath, file))

This outputs:

./todo.txt
./docs/doc1.odt