Linux: How To Quickly Remove Millions of Files

I tend to get myself involved with projects that have a lot of small files — on the order of a few million.  These files are often only good to me for a short period of time, such as when I’m importing a million xml files into a SQL Database for processing later.  I’ve also worked on projects where millions of spam emails pile up in queues to be filtered by humans, which they obviously never will be, and simply need to be removed.

The annoying thing about this is the time/effort it takes to remove all of these files quickly and with as little system resource consumption as possible.

In this entry I’m going to show you a couple of ways you can set your self up for quick removal of files that you know (or are pretty sure) will be temporary, so you need to quickly remove them.

Traditionally, if you had a big directory structure to remove, you would simply use rm -rf to remove the files:

rm -rf /home/mybigpath

Of course, this will take forever and chew up all sort of valuable system resources. We want our CPUs being used for business logic, not recursive directory searching and unlink operation!

So how do we avoid doing an unlink of all of those files? Well the key here is to set ourselves up knowing that we’re going to remove these files in the first place. Basically, we’re going to create another file system specifically for these millions of files, and when we’re done with them, we’ll simply remove/format the file system we were using. There are several ways we can do this, just to name a few:

  1. We can dedicate whole partitions of hard drives (logic or physical) to this temporarly location.
  2. We can use something lvm (or lvm2) to manage these temporary chunks of disk space.
  3. We can use loop-device file mounting to create truly temporary file stores that require one unlink operation to blow away!

I personally like #3 for thinks I know are going to be deleted, and #2 for things that I might want to keep around, so I need to be safe.

I’m going to first show you option #3 (loop-device file), as this is the easiest to make use of.  The basic steps for this:

  1. Create an “empty” file on your real file system.
  2. Format that file as a file system block.
  3. Mount the file using the “-o loop” option.
  4. Create millions/billions/trillions of files within the newly mounted file system, and do whatever processing needs to be done on them.
  5. When you’re done with the files, unmount the file and simply delete the large fake file system, which is a very simple and efficient operation for the OS to do.

Let’s dig into the details

~# cd /mnt 
/mnt# mkdir fakepoint 
# This creates a 200MB file 
/mnt# dd if=/dev/zero of=fake.img bs=1M count=200 
200+0 records in 
200+0 records out 
209715200 bytes (210 MB) copied, 6.93631 seconds, 30.2 MB/s      

# We're using reiserfs in this example, but any file system should work 
/mnt# mkreiserfs -f fake.img 
/mnt# mount -o loop /mnt/fake.img /mnt/fakepoint 
/mnt# df -h /mnt/fakepoint/ 
Filesystem            Size  Used Avail Use% Mounted on 
/mnt/fake.img         200M   33M  168M  17% /mnt/fakepoint      

/mnt# cd /mnt/fakepoint/      

# create 1,000,000 dummy files 
/mnt/fakepoint# perl -e 
      'for(1..1_000_000) { open(F,">",$_); print F "hello"; close(F) }'      

# do a bunch of processing on your files      

# what if we need to expand our file 
# because we didn't allocate enough? 
/mnt# umount /mnt/fakepoint/      

# seek 201mb into the file (so we don't overwrite what we have) 
# and add another 200mb 
/mnt# dd if=/dev/zero of=fake.img bs=1M count=200 seek=201 
200+0 records in 
200+0 records out 
209715200 bytes (210 MB) copied, 5.97021 seconds, 35.1 MB/s      

# resize the "file system" 
/mnt# resize_reiserfs fake.img       

# Re-mount 
/mnt# mount -o loop /mnt/fake.img /mnt/fakepoint 
/mnt# df -h /mnt/fakepoint/ 
Filesystem            Size  Used Avail Use% Mounted on 
/mnt/fake.img         401M  132M  270M  33% /mnt/fakepoint      

# all done with these files, we want to blow them away now 
# simple as this: 
/mnt# umount /mnt/fakepoint/ 
/mnt# rm fake.img

Notice that we can even re-size this “fake” file system.  This allows us nearly total flexibility in how we handle these millions of files.

With this technique, you could even “transfer” these files from one Linux system to another by simply copying the file system-file (fake.img, in this case) from one computer to another.

This works very well for simple projects and things that are temporary.  If you want a more permanent solution, you could use lvm2 to create a real logical disk partition that could be managed with enterprise class tools.   Most of the concepts stay the same, except you would use lvm2 to do all the dist-size management (the “dd”  steps.)

On a side note, here is a nice tutorial on using loop-devices to create encrypted data blocks.  Very cool stuff.

2 Responses to “Linux: How To Quickly Remove Millions of Files”

  1. Brendan Fraser Says:

    Hi…Man i just love your blog, keep the cool posts comin..holy Thursday

  2. Ivan Says:

    Hi, my name is disman-kl, i like your site and i ll be back ;)