When you use a scripting language like Python, one thing you will find yourself doing over and over again is walking a directory tree, and processing files. While there are many ways to do this, Python offers a built-in function that makes this process a breeze.
Basic Python Directory Traversal
Here’s a really simple example that walks a directory tree, printing out the name of each directory and the files contained:
# Import the os module, for the os.walk function import os # Set the directory you want to start from rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname)
os.walk takes care of the details, and on every pass of the loop, it gives us three things:
dirName: The next directory it found.
subdirList: A list of sub-directories in the current directory.
fileList: A list of files in the current directory.
Let’s say we have a directory tree that looks like this:
+--- test.py | +--- [subdir1] | | | +--- file1a.txt | +--- file1b.png | +--- [subdir2] | +--- file2a.jpeg +--- file2b.html
The code above will produce the following output:
Found directory: . file2a.jpeg file2b.html test.py Found directory: ./subdir1 file1a.txt file1b.png Found directory: ./subdir2
Changing the Way the Directory Tree is Traversed
By default, Python will walk the directory tree in a top-down order (a directory will be passed to you for processing), then Python will descend into any sub-directories. We can see this behaviour in the output above; the parent directory (.) was printed first, then its 2 sub-directories.
Sometimes we want to traverse the directory tree bottom-up (files at the very bottom of the directory tree are processed first), then we work our way up the directories. We can tell
os.walk to do this via the topdown parameter:
import os rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir, topdown=False): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname)
Which gives us this output:
Found directory: ./subdir1 file1a.txt file1b.png Found directory: ./subdir2 Found directory: . file2a.jpeg file2b.html test.py
Now we get the files in the sub-directories first, then we ascend up the directory tree.
Selectively Recursing Into Sub-Directories
The examples so far have simply walked the entire directory tree, but
os.walk allows us to selectively skip parts of the tree.
For each directory
os.walk gives us, it also provides a list of sub-directories (in
subdirList). If we modify this list, we can control which sub-directories
os.walk will descend into. Let’s tweak our example above so that we skip the first sub-directory.
import os rootDir = '.' for dirName, subdirList, fileList in os.walk(rootDir): print('Found directory: %s' % dirName) for fname in fileList: print('\t%s' % fname) # Remove the first entry in the list of sub-directories # if there are any sub-directories present if len(subdirList) > 0: del subdirList
This gives us the following output:
Found directory: . file2a.jpeg file2b.html test.py Found directory: ./subdir2
We can see that the first sub-directory (subdir1) was indeed skipped.
This only works when the directory is being traversed top-down since for a bottom-up traversal, sub-directories are processed before their parent directory, so trying to modify the
subdirList would be pointless since by that time, the sub-directories would have already been processed!
It’s also important to modify the
subdirList in-place, so that the code calling us will see the changes. If we did something like this:
subdirList = subdirList[1:]
… we would create a new list of sub-directories, one that the calling code wouldn’t know about.
For a more comprehensive tutorial on Python’s
os.walk method, checkout the recipe Recursive File and Directory Manipulation in Python. Or to take a look at traversing directories in another way (using recursion), checkout the recipe Recursive Directory Traversal in Python: Make a list of your movies!.