Traversing directories recursively

bogotobogo.com site search:

home directory

How to get the home directory in Python?

home = os.path.expanduser("~")

This will ensure it works on all platforms. Or we can do:

from os.path import expanduser
home = expanduser("~")

os.path.basename() vs. os.path.dirname()

What is the difference of os.path.basename() and os.path.dirname() functions?

Both functions use the os.path.split(path) function to split the pathname path into a pair; (head, tail).

path="/foo/bar/item"
The os.path.dirname(path) function returns the head of the path.
```
>>> os.path.dirname(path)
'/foo/bar'
```
The os.path.basename(path) function returns the tail of the path.
```
>>> os.path.basename(path)
'item'
```

Note that if we have a slash('/') at the end of the path, we get different result:

path="/foo/bar/item/"
The os.path.dirname(path) function returns the head of the path.
```
>>> os.path.dirname(path)
'/foo/bar/item'
```
The os.path.basename(path) function returns the tail of the path.
```
>>> os.path.basename(path)
''
```

os.walk()

os.walk(top, topdown=True, onerror=None, followlinks=False)

The os.walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up.

For each directory in the tree rooted at directory top, it yields a 3-tuple:
(dirpath, dirnames, filenames)

The dirpath is a string for the path to the directory. The dirnames is a list of the names of the subdirectories in dirpath (excluding '.' and '..'). The filenames is a list of the names of the non-directory files in dirpath.

Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name).

As a preview, we get the following output from the tree like this:

import os
path = "./TEST"

for root,d_names,f_names in os.walk(path):
	print root, d_names, f_names

Output:

./TEST ['D1', 'D2'] ['root.txt']
./TEST/D1 [] ['d1_b.txt', 'd1_a.txt']
./TEST/D2 [] ['d2_a.txt']

We can make a full path for each file:

import os
path = "./TEST"

fname = []
for root,d_names,f_names in os.walk(path):
	for f in f_names:
		fname.append(os.path.join(root, f))

print("fname = %s" %fname)

The output:

fname = ['./TEST/root.txt', './TEST/D1/d1_b.txt', './TEST/D1/d1_a.txt', './TEST/D2/d2_a.txt']

As another example, we're going to use the tree below:

dirpath:

import os
for dirpath, dirs, files in os.walk("."):
	print dirpath

Output:

.
./A
./A/AA
./C
./B
./B/BB

dirs:

import os
for dirpath, dirs, files in os.walk("."):
	print dirs

Output:

['A', 'C', 'B']
['AA']
[]
[]
['BB']
[]

files:

import os
for dirpath, dirs, files in os.walk("."):
	print files

Output:

['f1']
['f3', 'f2']
['f4', 'f5']
['f9']
['f6']
['f7', 'f8']

Listing files in directories recursively?

Here is the code:

import os
for dirpath, dirs, files in os.walk("."):	 
	path = dirpath.split('/')
	print '|', (len(path))*'---', '[',os.path.basename(dirpath),']'
	for f in files:
		print '|', len(path)*'---', f

Suppose we are now in TREE directory, then the output should look like this:

| --- [ . ]
| --- f1
| --- f0.py
| ------ [ A ]
| ------ f3
| ------ f2
| --------- [ AA ]
| --------- f4
| --------- f5
| ------ [ C ]
| ------ f9
| ------ [ B ]
| ------ f6
| --------- [ BB ]
| --------- f7
| --------- f8

Recursive directory traversing

How to traverse a directory tree?

One of the answers may be to use os.walk() to recursively traverse directories.

So, in this section, we want to print all file contents recursively using the os.walk():

import os
for dirpath, dirs, files in os.walk("./TREE/"):	
	for filename in files:
		fname = os.path.join(dirpath,filename)
		with open(fname) as myfile:
			print(myfile.read())

The key here is to use os.path.join() when we read the files. Note that the names in the lists contain no path components. To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, filename).

The output from the code:

inside f1
...
inside f8

Recursive directory traversing 2

Here is another example. It reads in CMakeLists.txt, and generate a csv file with the targets.

import os

home = os.path.expanduser("~")
root_dir = os.path.join(home, "TEST/TF/tf")
cmake_path = os.path.join(root_dir, "CMakeLists.txt")

# make target list
def make_target_list():
   target_tag="set(TARGET_NAME"
   target_list = []
   for dirpath, dirs, files in os.walk(root_dir):
      if "CMakeLists.txt" in files:
         for f in files:
            cmake_path = os.path.join(dirpath, f)
            with open(cmake_path,'rb') as lines:
               for line in lines:
                  if target_tag in line:
                     target = line[line.find(target_tag)+len(target_tag)+1:line.find(")")]
                     target_list.append(target.strip('"'))
   return target_list

# writing csv
def write_csv(t):
   import csv
   with open('tf.csv', 'wb') as f:
      w = csv.writer(f, delimiter=' ')
      w.writerow(t)

if __name__ == '__main__':
   target = make_target_list()
   print target
   write_csv(target)

Output:

['assignment-client', 'ice-server', 'plugins', 'octree', 'embedded-webserver', 'audio', 'script-engine', 'entities-renderer', 'render-utils', 'model', 'render', 'animation', 'gpu', 'input-plugins', 'networking', 'fbx', 'ui', 'shared', 'avatars', 'display-plugins', 'entities', 'environment', 'audio-client', 'auto-updater', 'physics', 'gpu-test', 'render-utils-test', 'ui-test', 'shaders-test', 'entities-test', 'interface', 'gvr-interface', 'scribe', 'vhacd-util', 'mtc', 'domain-server']

The csv file looks like this:

$ cat tf.csv
assignment-client ice-server plugins octree embedded-webserver audio script-engine entities-renderer render-utils model render animation gpu input-plugins networking fbx ui shared avatars display-plugins entities environment audio-client auto-updater physics gpu-test render-utils-test ui-test shaders-test entities-test interface gvr-interface scribe vhacd-util mtc domain-server

Recursive directory traversing 3

This is almost the same as previous ones.

I need to find files which have more than one unit of Google Ads (I am supposed to have only one of the type per page).

So, during the recursive file traversing, I have to include only (*.php) files not *.png, *.txt, etc. Also, I have to count the occurrence of the Ad-unit in a file. If a file has more than one unit, this code prints out two things : full path and the count.

Here is the code:

import os
for dirpath, dirs, files in os.walk("."): 
  for filename in files:
    fname = os.path.join(dirpath,filename)
    if fname.endswith('.php'):
      with open(fname) as myfile:
        line = myfile.read()
        c = line.count('bogo_sq')
        if c > 1:
          print fname, c

Output sample:

../AngularJS/AngularJS_More_Directives_HTML_DOM.php 2
../DevOps/AWS/aws_S3_Simple_Storage_Service.php 2
../Android/android22Threads.php 2
../python/pytut.php 2

Recursive directory traversing 4

Recently, I switched to https (SSL) from http. But found out, my CSSs were not loading properly, and it broke my pages. It turned out, I was using http, for example, "http://apis.google.com/js/plusone.js". So, I had to browse all files, and replace it to "//apis.google.com/js/plusone.js".

Here is the the script I used for the task:

import os
import sys
import fileinput

path = "../public_html"

# construct full file path recursively
file_list = []
for root,d_names,f_names in os.walk(path):
	for f in f_names:
		file_list.append(os.path.join(root, f))

print("file_list : %s" %file_list)

search = ['http://netdna', 'http://fonts', 'http://maxcdn', 'http://getbootstrap', 'http://php',
                'http://tinymce', 'http://wiki.moxiecode', 'http://www.google', 'http://www.apple',
                'http://activex.microsoft', 'http://download.macromedia', 'http://cdn.mathjax'
                'http://code.jquery', 'http://w.sharethis', 'http://s7.addthis',
                'http://ajax.googleapis', 'http://cdn.mathjax', 'http://d3js.org',
                'http://apis.google', 'http://pagead2.googlesyndication']

replace = ['//netdna', '//fonts', '//maxcdn', '//getbootstrap', '//php',
 		'//tinymce', '//wiki.moxiecode', '//www.google', '//www.apple',
 		'//activex.microsoft', '//download.macromedia', '//cdn.mathjax'
 		'//code.jquery','//w.sharethis', '//s7.addthis',
                '//ajax.googleapis','//cdn.mathjax','//d3js.org',
                '//apis.google', '//pagead2.googlesyndication']

if (len(search) != len(replace)) :
	sys.exit("Error: search does not match with replace")

# Loop through files and replace "search" with "replace"
for filename in file_list:
	# exclude img/video files
	if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.gif', '.bak', 
                                      '.mp4', '.ogv', '.mp3', '.ismv', '.ts', '.ismc', '.m3u8', 
                                      '.mov', '.avi', '.dv', '.wmv', '.ogg', '.mkv', '.webm',
                                      '.idx', '.tar', '.gz', '.zip', '.pdf', '.svg', '.bz')):
		continue

	print(filename)
 
	# Read in the file
	with open(filename, 'r') as file :
  		filedata = file.read()

	# Replace the target string
	for i in range(len(search)):
		filedata = filedata.replace(search[i], replace[i])

	# Write the file out again
	with open(filename, 'w') as file:
  		file.write(filedata)

I replaced quite a few API calls with something like this: "//apis.google". Just dropped the "http:". I also excluded some files such as media or compressed ones. I did not use in-place replacing since my python version was 2.7

The output (just listing the processed files):

...
../public_html/dj/static/admin/js/jquery.init.js
../public_html/dj/static/admin/js/collapse.js
../public_html/dj/static/admin/js/prepopulate.js
...