Efficientlly Walking File Directories in Rust (Part 2)

This post is meant to follow on from Efficientlly Walking File Directories in Rust and correct some innaccuracies in Speeding Up File Modification Time Lookup on Windows.

The overall goal of the 2 mentioned posts was to find an efficient way to get the file modification times for all the files in a directory tree.

Only Look Once

It took me a bit to realize, but the logic I was using was actually walking the directory structure twice:

  1. jwalk was used as a threaded way to get the names and types of each file in each directory.
  2. NtQueryDirectoryFile was used to get the modification times of the files that were provided by jwalk.

File I/O is generally slower than most other CPU processing. So in general only walking a directory once should show a decent improvement.

Accessing the files in each directory twice was happening becuase a direct call to get a file’s modified time via std::fs::modified() showed a significant time cost. I double checked that jwalk itself didn’t happen to propagate the modified time. One can see that the jwalk::DirEntry::from_entry() method copies some fields from the std::fs::DirEntry struct, but it happens to omit the metadata, which contains the modified time.

With the realization that the use of jwalk was causing me to go to the file system twice I decided to re-visit the topic of Efficientlly Walking File Directories in Rust

Directory Walking (with examples)

With a bit more knowledge in rust and with my desire to better back up my understanding with metrics I created, https://github.com/speedyleion/rust_directory_walking. This provides a simple executable that times the walking of directories using different approaches.

   $ directory_walking.exe llvm-project
   The jwalk time 163.6281ms, and the files 107593
   The ntquery_walk_dir time 185.8282ms, and the files 108404
   The walk_dir_path_is_dir time 638.7077ms, and the files 108450
   The walk_dir time 631.5173ms, and the files 108450
   The walk_dir_threaded time 147.7341ms, and the files 108451

This was a run against the llvm-project using commit 0f9f0a40.

And a nice table form:

method Time (ms) directory entries
jwalk 163.6281 107593
ntquery_walk_dir 185.8282 108404
walk_dir_path_is_dir 638.7077 108450
walk_dir 631.5173 108450
walk_dir_threaded 147.7341 108451

The timings fluctuate a little bit and these are just a snapshot of one run. The directory entry differences are based on how naivly the different implementations may or may not ignore dot files.

ntquery_walk_dir

The results of this version contradicts many of my claims made in Speeding Up File Modification Time Lookup on Windows.

I double checked the implementation of std::fs::read_dir() in the hopes that maybe it was using NtQueryDirectoryFile on the backend and was just more efficient than my attempt. However looking in the rust source code it looks to be using FindNextFileW.

I did some more research to see if perhaps I was completly misunderstanding NtQueryDirectoryFile and what it could do:

In the end it seems that some people have seen aperformance benefit. I’m attributing the slowdown due to something in my attempt. Perhaps using inefficient conversion logic for the file name.

walk_dir_path_is_dir

It doesn’t really show it in these results. In fact sometimes it will report faster than walk_dir. However on some windows systems I’ve noticed significant performance penalties when walking directories and using std::path::Path::is_dir(). On one system in particular, I saw timings for win-git-status going from 1.2 seconds, using std::path::Path::is_dir(), down to 500ms by using is_dir from std::fs::FileType. Going from memory I believe this was ~80k files. This was on a system with a spinning disk and a bit of security software. The timings reported above are on an SSD with a bit less security.

In general it’s probably better to access the file_type() of directory entries provided by read_dir.

walk_dir_threaded

I was a bit surprised that just a basic rayon::scope around read_dir outperforms jwalk. In jwalk’s defense, it is a generic threaded directory walking implementation. And jwalk clearly outperforms the single threaded walk_dir appraoch. Using rayon::scope around read_dir, one is more likely to write a specific implementation that isn’t as portable or reusable.

From the documentation on fs::DirEntry::metadata():

On Windows this function is cheap to call (no extra system calls needed), but on Unix platforms this function is the equivalent of calling symlink_metadata on the path.

This means that there is no extra cost, on windows, to getting the file modification times when using read_dir directly.

Summary

Based on the timings, my naive implementation of walking directories twice to compare modification times resulted in 349.4563ms (163.6281 + 185.8282) just walking through the files. Updating to use walk_dir_threaded only costs 147.7341ms or an improvement of ~55%.

This improvment was clearly seen when running win-git-status against the llvm-project at commit 0f9f0a40. The time went from ~770ms to ~570ms