Moving files matching Regex

marksbrain · November 25, 2007, 11:00am

Hi all,

A question relating to the code below.

What I’m trying to do is:
Files named CAL221107.xls, NCPH141202.xls or GOH333333.xls for example
in the current directory should be appended to an array (just the file
name, not the full path) and moved to ./sent/

It seems to be working, but is there a more Rubyish way of doing it?

require ‘find’
require ‘fileutils’

a=[]
fglob = Regexp.new(’(CAL|NCPH|GOH)\d{6}.xls’)

Find.find(’./’) do |path|
curr_file = File.basename(path)
if curr_file =~ /#{fglob}/
a << curr_file
new_file = “./sent/#{curr_file}”
FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)
end
end

puts a

thanks,

marksbrain · November 25, 2007, 11:30am

Alle domenica 25 novembre 2007, Mark W. ha scritto:

a << curr_file
new_file = "./sent/#{curr_file}"
FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)

end
end

puts a

thanks,

Not very different:

require ‘find’
require ‘fileutils’

a=[]
fglob = /(CAL|NCPH|GOH)\d{6}.xls/

Find.find(’.’) do |f|
name = File.basename f
if name=~ fglob
a << name
FileUtils.mv f, “sent/#{name}” unless File.exist? “sent/#{name}”
end
end

A couple of notes:

You can create a regexp using the // syntax, instead of using
Regexp.new
when you call String#=~ you don’t need to use /#{…}/, if you already
have
a regexp
I think you need to pass path, and not curr_file, as the first
argument of
FileUtils.mv, otherwise it won’t work if the file is in a subdirectory.

Stefano

marksbrain · November 25, 2007, 12:11pm

Hi Stefano,

On Sun, 25 Nov 2007 05:30:10 -0500
Stefano C. [email protected] wrote:

name = File.basename f
if name=~ fglob
a << name
FileUtils.mv f, “sent/#{name}” unless File.exist? “sent/#{name}”
end
end

A couple of notes:

You can create a regexp using the // syntax, instead of using
Regexp.new

that’s better, thanks

when you call String#=~ you don’t need to use /#{…}/, if you
already have a regexp

another good thing to know.

I think you need to pass path, and not curr_file, as the first
argument of FileUtils.mv, otherwise it won’t work if the file is in a
subdirectory.

There won’t be any sub directories except for /sent.
I’ve discovered from the cookbook I can use prune to ignore it.
Below’s where I’m up to.

Stefano

appreciate your help,
cheers,

–
Mark

require ‘find’
require ‘fileutils’

FILE_REGEX = /(CAL|NCPH|GOH)\d{6}.xls/
SENT_DIR = “./sent/”

def get_file_names
fn=[]
Find.find(‘./’) do |path|
Find.prune if File.basename(path) == ‘sent’
curr_file = File.basename(path)
if curr_file =~ FILE_REGEX
fn << curr_file
end
end
fn
end

def move_files
@a.each do |fn|
new_loc = SENT_DIR + fn
FileUtils.mv(fn, new_loc) if not File.exists?(new_loc)
end
end

@a = get_file_names
move_files
puts @a

marksbrain · November 25, 2007, 12:48pm

Mark W. wrote:

There won’t be any sub directories except for /sent.

I add this to the advice from Stefano:

if you just want to find the files in the current directory, then we
do not need to use Find (which, as Stefano pointed out, could find files
in subdirectories that you may not foresee now…); and we can collect
the files in one shot with grep.
for precaution, it is better to match the complete file name (just in
case there may be other files with that name + an extension, eg like
.saved, etc).
we could place the method that moves the files into class Array,
rather than leaving it in the toplevel. (Consider a param for the
destination dir, unless you are sure that it will be always a Constant).

class Array
def move_to_subdir
self.each do |fn|
new_loc = SENT_DIR + fn
FileUtils.mv(fn, new_loc) unless File.exists?(new_loc)
end
end
end

the main is now 4 lines:

fglob= %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x

all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

my_files.move_to_subdir

Regards,

Raul

marksbrain · November 25, 2007, 5:25pm

On 25.11.2007 12:48, Raul P. wrote:

we could place the method that moves the files into class Array,
rather than leaving it in the toplevel. (Consider a param for the
destination dir, unless you are sure that it will be always a Constant).

I would not do that because it is not really an Array operation. Having
a top level method (aka function) is better IMHO.

Kind regards

robert

marksbrain · November 26, 2007, 8:35am

2007/11/25, Raul P. [email protected]:

ask the reason, the answer is a vague "they do not really fit well
…
files.each do |curr_file|
FileUtils.mv_files_to_subdir(my_files, ‘sent’)
#-------

It is a small point (some may say ‘but what’s the difference with a
top-level method?’), but let me know your opinion, if you have a chance

Definitively the right way to go if there is a chance that the code is
reused and might create namespace collisions in a larger project. If
if is just used in a simple script I would stick with the top level
method defined inside the script because distributing the code would
be unnecessarily complex in that case. One can refactor later if
needed.

Kind regards

robert

marksbrain · November 25, 2007, 11:46pm

Robert K. wrote:

I would not do that because it is not really an Array operation.

You are right; it does not belong in class Array.

Having a top level method (aka function) is better IMHO.

Here I use the occasion to share a perplexity I have, when I look at
Ruby programs, and I sometimes see a myriad of top-level methods. If I
ask the reason, the answer is a vague “they do not really fit well
anywhere”, which leaves me a bit disconcerted (especially when the
script is part of a project that may grow, not something written to be
used once).

In the case discussed, I would place this method in a module (that would
just need to be required):

module MyUtils
def self.mv_files_to_subdir(files, subdir)
files.each do |curr_file|
…
end
end
end

or, as this is in fact a file operation, I would consider to reopen
FileUtils and add the method:

#–
module FileUtils
def self.mv_files_to_subdir(files, subdir)
files.each do |curr_file|
new_file = “./#{subdir}/#{curr_file}”
FileUtils.mv(curr_file, new_file) unless File.exists?(new_file)
end
end
end

fglob= %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x
all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

FileUtils.mv_files_to_subdir(my_files, ‘sent’)
#-------

It is a small point (some may say ‘but what’s the difference with a
top-level method?’), but let me know your opinion, if you have a chance

Regards,

Raul

marksbrain · November 26, 2007, 11:45am

On Sun, 25 Nov 2007 20:59:46 +1100
Mark W. [email protected] wrote:

thanks to all replies. Very informative.
I’ve benchmarked this and ‘get_file_names2’ is considerably quicker!
(if I’m benchmarking correctly?)

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

def get_file_names
fn=[]
Find.find(‘./’) do |path|
Find.prune if File.basename(path) == ‘sent’
curr_file = File.basename(path)
if curr_file =~ FILE_REGEX
fn << curr_file
end
end
fn
end

def get_file_names2
fglob= %r{^(CAL|NCPH|GOH)\d{6}.xls$}x
all_files = Dir.glob(“*”)
my_files = all_files.grep(fglob)
end

Benchmark.bm(5) do |timer|
timer.report(‘get_file_names’) {10000.times {get_file_names}}
timer.report(‘get_file_names2’) {10000.times {get_file_names2}}
end

Results:

             user       system     total     real

get_file_names 3.040000 0.700000 3.740000 (5.440394)
get_file_names2 0.570000 0.160000 0.730000 (0.997355)

cheers,

marksbrain · November 26, 2007, 1:36pm

2007/11/26, Mark W. [email protected]:

On Sun, 25 Nov 2007 20:59:46 +1100
Mark W. [email protected] wrote:

thanks to all replies. Very informative.
I’ve benchmarked this and ‘get_file_names2’ is considerably quicker!
(if I’m benchmarking correctly?)

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

In _2 you do a dir glob in one directory only while find recursively
descends a directory tree. Even though you prune in some cases it
comes as no surprise that this is slower.

end
Benchmark.bm(5) do |timer|

Kind regards

robert

marksbrain · November 26, 2007, 8:39pm

Mark W. wrote:

I’ve benchmarked this and ‘get_file_names2’ is considerably quicker!

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

Robert K. wrote:

In _2 you do a dir glob in one directory only while find recursively
descends a directory tree. Even though you prune in some cases it
comes as no surprise that this is slower.

Actually, I think that most of the gain in the final solution comes from
this:
a) in the original solution, the result array is built inserting
manually each file that matches the pattern in the array…
b) in the final solution (get_file_names2) not only Find is replaced by
Dir.glob, but the result array is built by grep.

This may sound surprising, but map (collect) and grep are highly
optimized C mechanisms; moreover they can do smart guesses on the size
of the result array (in particular, grep can inmediately conclude that
the size of the result array will be less than the array given to him).
In the manual solution instead, every item added to the array probably
requires some memory reallocation.

Let’s verify if this is true, benchmarking 3 variants:

get_file_names : Find + manual array build
get_file_names1: Dir.glob + manual array build
get_file_names2: Dir.glob + grep

uses Find + manual array build

def get_file_names
fn=[]
Find.find(’./’) do |path|
Find.prune if File.basename(path) == ‘sent’
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x
fn << curr_file
end
end
fn
end

uses Dir.glob + manual array build

def get_file_names1
all_files = Dir.glob("*")
my_files.each do |path|
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x
fn << curr_file
end
end
fn
end

uses Dir.glob + grep

def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

with 10 files matching the pattern

Benchmark.bm(5) do |timer|
timer.report(‘get_file_names’) {10000.times {get_file_names}}
timer.report(‘get_file_names1’) {10000.times {get_file_names}}
timer.report(‘get_file_names2’) {10000.times {get_file_names2}}
end

             user       system      total      real

get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)

The 580% improvement in the final solution comes almost exclusively from
the grep.

I add this for Mark:
it is best in any case to avoid the unnecessary ‘Find’, for several
reasons; the first is that what you need is the simple Dir.glob.

But let me cite a more insidious problem in that code: the idea of
pruning explicitly ‘sent’. Even if you think that the only subdirectory
present will be ‘sent’, what happens if 8 months from now another person
(or yourself, after some more thousands lines of code!) needs to add
another subdir with files matching that pattern (perhaps just to save a
version of them)? will he remember the assumption done months before?

Moreover, as Stefano pointed out at the beginning of the thread, this
would cause subtle file corruptions, as the code is not set up to deal
with the path correctly (a testing nightmare would follow, until
somebody remembers the ‘assumption’; some people have seen this hundreds
of times…:-).

So, the best is not to code ‘assumptions’ in the code; and of course to
use the right tools for the problem at hand: in this case, Dir.glob and
the beautiful grep

Regards

Raul

marksbrain · November 26, 2007, 9:37pm

On 26.11.2007 20:39, Raul P. wrote:

comes as no surprise that this is slower.
of the result array (in particular, grep can inmediately conclude that

uses Find + manual array build

end
fn
end

uses Dir.glob + grep

def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

What about this one?

def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)
pruning explicitly ‘sent’. Even if you think that the only subdirectory

So, the best is not to code ‘assumptions’ in the code; and of course to
use the right tools for the problem at hand: in this case, Dir.glob and
the beautiful grep

Good point! That’s basically the same reason why it’s better to
explicitely open ports in firewall configs than to close “dangerous”
ports.

Kind regards

robert

marksbrain · November 27, 2007, 8:43am

2007/11/27, Raul P. [email protected]:

I had not tested case 1 in the benchmark (I dropped digit ‘1’ in the

all_files = Dir.glob(“*”)
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

We completely forgot about Dir.entries()… That might be a tad
faster because supposedly no globbing is going on.

timer.report(‘get_file_names3’) {10_000.times {get_file_names3} }
contribution comes from removing Find (as Robert had guessed)!
theorized), which accounts only for 10% of the improvement; the other
90% comes from removing Find, as Robert had guessed.

It’s an old truth: access to external memory is always an order of
magnitude slower than access to main memory. Ergo, if you can reduce
the number of accesses to that memory you typically achieve big
improvements.

Last consideration: increasing the number of files, the weight of ‘grep’
in the improvement increases (but, enough of benchmarks for to-day :-).

Thanks for taking the time to go through this.

Kind regards

robert

marksbrain · November 27, 2007, 12:17am

Robert K. wrote:

What about this one?

def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

Robert

I had not tested case 1 in the benchmark (I dropped digit ‘1’ in the
call); it turns out that it is the Find, as you had thought, to be the
one that causes most inefficiency; without repeating all the code, these
are the results, including also your last suggestion.

uses Find, and selects files manually

def get_file_names
fn=[]
…
fn
end

1) uses Dir.glob and builds array with a loop

def get_file_names1
fn=[ ]
all_files = Dir.glob("*")
…
fn
end

2) uses Dir.glob and grep

def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

3) variation of solution 2

def get_file_names3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

with 30 files

Benchmark.bm(5) do |timer|
timer.report(‘get_file_names’) {10_000.times {get_file_names} }
timer.report(‘get_file_names1’) {10_000.times {get_file_names1} }
timer.report(‘get_file_names2’) {10_000.times {get_file_names2} }
timer.report(‘get_file_names3’) {10_000.times {get_file_names3} }
end

           user        system      total       real

get_file_names 14.640000 9.080000 23.720000 ( 23.778029)
get_file_names1 1.690000 1.200000 2.890000 ( 2.903737)
get_file_names2 1.370000 1.210000 2.580000 ( 2.581539)
get_file_names3 1.430000 3.530000 4.960000 ( 4.968951)

Solution 2) is, as we saw before, the winner: 10 times faster than the
original solution. But the grep only improves things by 10%; 90% of the
contribution comes from removing Find (as Robert had guessed)!

Regarding the last one (called solution 3) from you:

Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]

This turned out to be 2 times slower than solution 2. Checking if
something is between 0-9 is apparently quite slower than checking for
‘digit’.

In conclusion:
Solution 2 is the fastest; but the reason is not the grep (as I had
theorized), which accounts only for 10% of the improvement; the other
90% comes from removing Find, as Robert had guessed.

Last consideration: increasing the number of files, the weight of ‘grep’
in the improvement increases (but, enough of benchmarks for to-day :-).

Regards

Raul

marksbrain · November 27, 2007, 9:39am

Robert K. wrote:

all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

We completely forgot about Dir.entries()… That might be a tad
faster because supposedly no globbing is going on.

The solution with Dir.entries is a bit faster:

def get_file_names5
all_files = Dir.entries(".")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end

30 target files present

Benchmark.bm(5) do |timer|
timer.report(‘get_file_names2’) {10_000.times {get_file_names2} }
timer.report(‘get_file_names5’) {10_000.times {get_file_names5} }
end
user system total real
get_file_names2 1.410000 1.260000 2.670000 ( 2.675148)
get_file_names5 1.190000 1.260000 2.450000 ( 2.450649)

A 10% gain; not much, but the final winner of the marathon is: Nr 5!

Raul

marksbrain · November 27, 2007, 9:41am

Since all the matches are xsl files, the function can be faster using
a glob for “*.xsl”. Dir.entries is slower.

$ cat test.rb && ./test.rb
#!/usr/bin/env ruby

require ‘find’
require ‘fileutils’
require ‘benchmark’

def get_file_names
Dir.glob("*.xsl").select { | e |
e =~ /^(CAL|NCPH|GOH)\d{6}.xsl$/
}
end

def get_file_names1
Dir.glob("*.xsl").grep(/^(CAL|NCPH|GOH)\d{6}.xsl$/)
end

def get_file_names2
Dir.entries(’.’).select { | e |
e =~ /^(CAL|NCPH|GOH)\d{6}.xsl$/
}
end

n = 10000
Benchmark.bm(5) { |x|
x.report('get_file_names ') { n.times { get_file_names } }
x.report(‘get_file_names1’) { n.times { get_file_names1 } }
x.report(‘get_file_names1’) { n.times { get_file_names2 } }
}

       user     system      total        real

get_file_names 0.210000 0.330000 0.540000 ( 0.566366)
get_file_names1 0.270000 0.310000 0.580000 ( 0.602565)
get_file_names1 0.680000 0.310000 0.990000 ( 1.114583)

Regards,
Jordan

marksbrain · November 27, 2007, 10:55am

Excellent stuff everyone,
Gives me much to look into!

thanks,

On Tue, 27 Nov 2007 03:39:05 -0500