Linux: How To Quickly Remove Millions of Files
I tend to get myself involved with projects that have a lot of small files — on the order of a few million. These files are often only good to me for a short period of time, such as when I’m importing a million xml files into a SQL Database for processing later. I’ve also worked on projects where millions of spam emails pile up in queues to be filtered by humans, which they obviously never will be, and simply need to be removed.
The annoying thing about this is the time/effort it takes to remove all of these files quickly and with as little system resource consumption as possible.
In this entry I’m going to show you a couple of ways you can set your self up for quick removal of files that you know (or are pretty sure) will be temporary, so you need to quickly remove them.
Traditionally, if you had a big directory structure to remove, you would simply use rm -rf to remove the files:
rm -rf /home/mybigpath
Of course, this will take forever and chew up all sort of valuable system resources. We want our CPUs being used for business logic, not recursive directory searching and unlink operation!
So how do we avoid doing an unlink of all of those files? Well the key here is to set ourselves up knowing that we’re going to remove these files in the first place. Basically, we’re going to create another file system specifically for these millions of files, and when we’re done with them, we’ll simply remove/format the file system we were using. There are several ways we can do this, just to name a few:
- We can dedicate whole partitions of hard drives (logic or physical) to this temporarly location.
- We can use something lvm (or lvm2) to manage these temporary chunks of disk space.
- We can use loop-device file mounting to create truly temporary file stores that require one unlink operation to blow away!
I personally like #3 for thinks I know are going to be deleted, and #2 for things that I might want to keep around, so I need to be safe.
I’m going to first show you option #3 (loop-device file), as this is the easiest to make use of. The basic steps for this:
- Create an “empty” file on your real file system.
- Format that file as a file system block.
- Mount the file using the “-o loop” option.
- Create millions/billions/trillions of files within the newly mounted file system, and do whatever processing needs to be done on them.
- When you’re done with the files, unmount the file and simply delete the large fake file system, which is a very simple and efficient operation for the OS to do.
Let’s dig into the details
~# cd /mnt
/mnt# mkdir fakepoint
# This creates a 200MB file
/mnt# dd if=/dev/zero of=fake.img bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 6.93631 seconds, 30.2 MB/s
# We're using reiserfs in this example, but any file system should work
/mnt# mkreiserfs -f fake.img
/mnt# mount -o loop /mnt/fake.img /mnt/fakepoint
/mnt# df -h /mnt/fakepoint/
Filesystem Size Used Avail Use% Mounted on
/mnt/fake.img 200M 33M 168M 17% /mnt/fakepoint
/mnt# cd /mnt/fakepoint/
# create 1,000,000 dummy files
/mnt/fakepoint# perl -e
'for(1..1_000_000) { open(F,">",$_); print F "hello"; close(F) }'
# do a bunch of processing on your files
# what if we need to expand our file
# because we didn't allocate enough?
/mnt# umount /mnt/fakepoint/
# seek 201mb into the file (so we don't overwrite what we have)
# and add another 200mb
/mnt# dd if=/dev/zero of=fake.img bs=1M count=200 seek=201
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 5.97021 seconds, 35.1 MB/s
# resize the "file system"
/mnt# resize_reiserfs fake.img
# Re-mount
/mnt# mount -o loop /mnt/fake.img /mnt/fakepoint
/mnt# df -h /mnt/fakepoint/
Filesystem Size Used Avail Use% Mounted on
/mnt/fake.img 401M 132M 270M 33% /mnt/fakepoint
# all done with these files, we want to blow them away now
# simple as this:
/mnt# umount /mnt/fakepoint/
/mnt# rm fake.img
Notice that we can even re-size this “fake” file system. This allows us nearly total flexibility in how we handle these millions of files.
With this technique, you could even “transfer” these files from one Linux system to another by simply copying the file system-file (fake.img, in this case) from one computer to another.
This works very well for simple projects and things that are temporary. If you want a more permanent solution, you could use lvm2 to create a real logical disk partition that could be managed with enterprise class tools. Most of the concepts stay the same, except you would use lvm2 to do all the dist-size management (the “dd” steps.)
On a side note, here is a nice tutorial on using loop-devices to create encrypted data blocks. Very cool stuff.
November 15th, 2007 at 7:59 am
Hi…Man i just love your blog, keep the cool posts comin..holy Thursday
November 29th, 2007 at 5:49 am
Hi, my name is disman-kl, i like your site and i ll be back