I’m writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it’s got huuuuUUUge quarantine
directories.
Now, I know I can do something along the lines of:
Dir.open("/foo").collect.length-2 #if you’re wondering, the -2 is to
ignore . and …
to get a count of what’s in a directory, but the problem there is,
it’s rather slow when you run that in a directory with a few thousand
files on a server under a severe (4.5>average_load>2) load.
After perusing the Dir, Find and Stat classes, I haven’t seen a better
way.
I thought that perhaps there was some sort of system call, at least in
Real OSesâ„¢ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20…) seconds.
I’m writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it’s got huuuuUUUge quarantine
directories.
Now, I know I can do something along the lines of:
Dir.open("/foo").collect.length-2 #if you’re wondering, the -2 is to
ignore . and …
You could as well do
count = Dir.entries("/foo").size - 2
Any clues?
The major time will be IO and that cannot be changed I guess. You could
however do some form of caching: read the size and the last mod date of
each dir you are interested in and store that in a Hash (and write that
via Marshal to disk between invocations if you process terminates in
between). Then you need only check whether the mod date has changed and
only read the directory if it has. Disadvantage is that you need one
more IO - albeit that will pull just one block so it might pay off.
Entries seems to be fairly identical to collect, and it does look
nicer…
but yea still slow.
The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.
In other cases than a mail system, caching would be a very good idea
though.
I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.
Entries seems to be fairly identical to collect, and it does look nicer…
but yea still slow.
As I said: it’s the IO for crowded directories (see also Mike’s reply).
The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.
In other cases than a mail system, caching would be a very good idea though.
I am not sure I understand why you think it is a bad idea. If you only
cache the number of files per directory where is the issue? Or is this
script not invoked regularly? Probably I am missing a bit of your use
case.
I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.
Won’t help. It’s really the size of the directory. Maybe you give a
little more detail about your script and when it’s used so we can come
up with better suggestions.
Entries seems to be fairly identical to collect, and it does look
nicer…
but yea still slow.
The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.
If there’s a large number of files in these directories that’s probably
the source of the slowness, not the method used to get the list of
entries.
Many filesystems (some less than others) don’t behave as well when you
get a “large” number of files in one directory. I think the rule of
thumb I’ve used for ext2 filesystems is you’ll start to notice a delay
when you get a few hundred entries, and you’ll start to feel it when you
have thousands.
One way around this (short of installing / upgrading to a new underlying
filesystem that handles these cases better (xfs, for example)) is to
split files out into a directory tree based either on the filename
directly or a hash made from the real filename (say an MD5 hex string of
the filename and you make two levels based on the first 4 hex digits,
00/00, 00/01, …, ff/fe, ff/ff; 00/00 contains all files for which the
hashed filename begins “0000…”, etc.). The downside of this is that
you either have to walk the entire tree to see the contents, or keep an
external index of the contents (which would eliminate your needing to do
what you’re trying to do and the justification for splitting things up,
but . . . :).
I thought that perhaps there was some sort of system call, at least in
Real OSesâ„¢ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20…) seconds.
Any clues?
Thanks,
Kyle
On Windows there is such a call. With Ruby you have to take a bit of a
(known) detour to get there:
puts folder.Files.count
Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.
puts folder.Files.count
Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.
Robert,
The script itself won’t be run as routinely as the
directories are rotated. The directories have a daily rotation so
there are only the most recent 10 days available at once, but the
script itself may only be invoked once or twice in a month, at most.
I understand that the size of the directory itself is a problem, but I
was hoping that somehow there was a way to get a simple, more
efficient count. I know the b-tree based file systems are somewhat
new in unix & unix-like systems, I was just hoping there was some more
efficient way
The script itself (as it stands now, albeit slower than I would have
liked) does the following:
With no arguments, lists the number of quarantined and spam messages
being held, for each day.
With a date, lists the file names of the quarantined messages, as well
as their recipients.
With a date and the file name of a quarantined message, warns the
user, asks them if they want to continue, then moves the message back
into the appropriate queue to be delivered.
Siep,
I’ve had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You’re right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel…
Siep,
I’ve had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You’re right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel…
able to do it easily.
have thousands.
external index of the contents (which would eliminate your needing to do
what you’re trying to do and the justification for splitting things up,
but . . . :).
Mike,
I’ve been an advocate of using the right file system for the
job for ages now, but the sad matter is, this is running on a rather
old version of RedHat, which doesn’t support anything real other than
ext2 & 3. As for our possible upgrade paths to this box, it would
still be RedHat, or a clone (CentOS). From what I can see, they still
don’t support modern file systems by default. Admittedly I’m tempted
to add the support myself (it’s not hard), but then it’ll bring up the
“its a production system” argument here.
I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.
Thanks,
Kyle
You may have already gotten here…
What kind of times does this give? ( the first run will include the
initial
compilation time )
You can modify it to meet your needs ( if you have questions, just post
back
) – see man scandir
you can setup a filter function to allow returning counts for specific
file
matches.
As is, it returns a count for all files, visible and hidden.
class DirCount
inline do | builder |
builder.include ‘<dirent.h>’
builder.include ‘<stdio.h>’
builder.c "
int count() {
struct dirent **namelist;
int n;
int count;
count = n = scandir(\".\", &namelist, 0, 0);
if (n < 0)
perror(\"scandir\");
else {
while(n--) {
/* printf(\"%s\n\", namelist[n]->d_name);*/
free(namelist[n]);
}
free(namelist);
}
return (count);
}"
end
end
dc = DirCount.new()
puts dc.count()
-----------snip--------------------
Oh, that’s quite an interesting way to work with files and folders
-Thufir
Yes, indeed, infact it’s quite efficient according to the benchmark
tests above!
By the way, is there like a link or documentation on the list of
class/methods that can be used like GetFolder,GetFile… etc? So far i
only know abt these 2 methods…