Sorting a logfile, how would you write it?

dizzyone · August 10, 2007, 8:29pm

I’ve written a little ruby program which can sort logfiles with the
following format:

4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2

The file is given as a command line parameter and after sorting the
entries it writes them back into this file.

The program is in the attachement.

What I want to know is how would you write such a tool in ruby? I’m
asking this because I’m still learning ruby and I want to learn how to
do it in ruby (ans its design principles).

Thank you!

Turing

dizzyone · August 10, 2007, 10:55pm

On Aug 10, 1:29 pm, Frank M. [email protected] wrote:

entries it writes them back into this file.

Attachments:http://www.ruby-forum.com/attachment/86/test.rb

File.open( ARGV.first, “r+” ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

dizzyone · August 11, 2007, 2:20pm

On 11.08.2007 06:19, Ryan D. wrote:

10.text text text
the content of the line, just the number. swap the two “2.” lines and
% time ruby -e 'path = ARGV.shift; system %(sort -n “#{path}” >

file0 file0 file0

file1 file1 file1

file3 file3 file3

file0 file0 file0

file4 file4 file4

file1 file1 file1

file3 file3 file3

file3 file3 file3
532 %

It’s a one liner:

ruby -i.bak -e ‘puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}’ file

Less memory usage:

ruby -i.bak -e ‘puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}’ file

Kind regards

robert

dizzyone · August 11, 2007, 6:21am

On Aug 10, 2007, at 13:54 , William J. wrote:

File.open( ARGV.first, “r+” ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

your version takes a lot of memory, is slow, and doesn’t properly
sort the content of the line, just the number. swap the two “2.”
lines and you’ll see what I mean. Using the right tool for the job
(sort) does wonders:

% ruby -e ‘n = 1_000_000; File.open(“blah.txt”, “w”) { |f| n.times
{ m = rand 5; f.puts “#{rand n}. file#{m} file#{m} file#{m}” } }’
% cp blah.txt blah2.txt
% time ruby -e ‘File.open( ARGV.first, “r+” ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts array.sort_by
{|s| s[/^\d+/].to_i } }’ blah.txt
real 0m8.182s …
% time ruby -e ‘path = ARGV.shift; system %(sort -n “#{path}” > “#
{path}.tmp”); File.rename “#{path}.tmp”, path’ blah2.txt
real 0m3.175s …
% cmp blah.txt blah2.txt
blah.txt blah2.txt differ: char 50, line 3
% head blah.txt blah2.txt
==> blah.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file3 file3 file3
6. file1 file1 file1
6. file0 file0 file0
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3

==> blah2.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file0 file0 file0
6. file1 file1 file1
6. file3 file3 file3
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
532 %

dizzyone · August 12, 2007, 12:31am

On Aug 11, 7:15 am, Robert K. [email protected] wrote:

following format:
file.truncate(0)
% cp blah.txt blah2.txt
==> blah.txt <==

file3 file3 file3
532 %

It’s a one liner:

ruby -i.bak -e ‘puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}’ file

It’s my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed. Doesn’t this cause unnecessary disk
fragmentation?

Less memory usage:

ruby -i.bak -e ‘puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}’ file

Of course, you’re trading speed for memory.

dizzyone · August 12, 2007, 1:16am

— William J. [email protected] wrote:

It’s my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed. Doesn’t this cause unnecessary disk
fragmentation?

To do this safely you’ll need a temporary file.
Slurping a file into memory, sorting it, then writing it back to the
same
file is an unsound practice, i.e. not “rerunnable-safe”. Suppose, for
example, you suffer a power failure half-way through writing back the
file,
or the write fails due to “disk full” or “user disk quota exceeded” or
for
any other reason. Oops, you’ve just corrupted your input file. Worse, if
this program is run automatically or by a naive user you risk permanent
data loss without knowing about it, for when the program is re-run after
the write failure, it will use as its input file the now corrupted file
and appear to “work”.

The simple remedy is to first write to a temporary file; once you’re
sure
the temporary file has been written without error (and after the
permissions
on the temporary are updated to match the original) you then
(atomically)
rename the temporary file to the original. In that way, if writing the
new
file is interrupted for any reason, you can simply re-run the program
without losing any data. Though you could roll all this code for
yourself,
the -i switch is more convenient, and safer than rewriting a file in
place.

Cheers,
/-\

dizzyone · August 12, 2007, 1:54am

On Aug 11, 2007, at 15:25, William J. wrote:

Wrong.

This method uses at least 2x the file size worth of memory. That’s a
lot.

E:\Ruby>ruby -e 'path = ARGV.shift; system %(sort -n “#{path}”

“#{path}.tmp”); File.rename “#{path}.tmp”, path’ data
-e:1: unterminated string meets end of file

I just ran it, it worked fine.

You’ll probably have to redo the quoting for a non-bourne-compatible
shell.

Perhaps your attempt at a solution requires Unix, and you,
in your ignorance, or your thoughtlessness, or your
ignorance and your thoughtlessness, assumed that every
user of Ruby is a user of Unix.

Please try to flame harder. This one just made me chuckle.

dizzyone · August 12, 2007, 12:26am

On Aug 10, 11:19 pm, Ryan D. [email protected] wrote:

2.text text text
your version takes a lot of memory,
Wrong.

When the number of lines to sort is small,
it uses a small amount of memory.
When the number of lines to sort is medium,
it uses a medium amount of memory.
When the number of lines to sort is large,
it uses a large amount of memory.

                   is slow,

Everything is relative. If its speed is compared to the
speed of other versions written in scripting languages, it
is not slow.

                            and doesn't properly

sort the content of the line,

Wrong.

Looking at the source code of the original poster immediately
reveals that he wants to sort only on the number at the
beginning of the line.

real 0m8.182s …
% time ruby -e ‘path = ARGV.shift; system %(sort -n “#{path}” > “#
{path}.tmp”); File.rename “#{path}.tmp”, path’ blah2.txt

Wrong.

The original poster stated:

The file is given as a command line parameter and after sorting the
entries it writes them back into this file.

Your code makes no attempt to write to the original file; it uses
a temporary file.

Furthermore, your solution won’t even run:

E:\Ruby>ruby -e 'path = ARGV.shift; system %(sort -n “#{path}”

“#{path}.tmp”); File.rename “#{path}.tmp”, path’ data
-e:1: unterminated string meets end of file

If your code is put in a file …

E:\Ruby>ruby try.rb data
Input file specified two times.

… it still won’t work.

Perhaps your attempt at a solution requires Unix, and you,
in your ignorance, or your thoughtlessness, or your
ignorance and your thoughtlessness, assumed that every
user of Ruby is a user of Unix.

dizzyone · August 12, 2007, 5:45am

On 8/11/07, Eric H. [email protected] wrote:

Perhaps your attempt at a solution requires Unix, and you,
in your ignorance, or your thoughtlessness, or your
ignorance and your thoughtlessness, assumed that every
user of Ruby is a user of Unix.

Please try to flame harder. This one just made me chuckle.

Me too. Besides, sort is still the right tool

http://www.mingw.org/msys.shtml

dizzyone · August 12, 2007, 2:02am

On Aug 11, 2007, at 15:30, William J. wrote:

On Aug 11, 7:15 am, Robert K. [email protected] wrote:

It’s a one liner:

ruby -i.bak -e ‘puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}’
file

It’s my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed. Doesn’t this cause unnecessary disk
fragmentation?

If I had a filesystem where I had to worry about fragmentation I
wouldn’t care. The amount of time spent figuring out some best way
to “fix” it is going to be less than the time running a defragmenter
will take.

dizzyone · August 12, 2007, 9:11am

On Aug 11, 6:16 pm, Andrew S. [email protected] wrote:

example, you suffer a power failure half-way through writing back the file,
or the write fails due to “disk full” or “user disk quota exceeded” or for
any other reason. Oops, you’ve just corrupted your input file.

Of course. But I’m willing to take that miniscule chance when
I’m doing a write to a small file that takes a fraction of a
second.

The question remains: doesn’t using a temp file cause more
disk fragmentation than writing directly to the original file?

dizzyone · August 12, 2007, 10:06am

— William J. [email protected] wrote:

Of course. But I’m willing to take that miniscule chance when
I’m doing a write to a small file that takes a fraction of a
second.

That may be an acceptable risk for a program written for private use.
Not so for a production program. After all, impatient users often
press [CTRL-C] in my experience, and that could cause corruption
if it occurred while the file was being rewritten.

Certainly, such a program would fail code review at my company for
anything we send to our clients.

The question remains: doesn’t using a temp file cause more
disk fragmentation than writing directly to the original file?

Probably … though correctness should come before performance
(I’ll refrain from quoting Donald Knuth at you ;-).
And, in this case, any performance degradation with modern file
systems should be negligible.

Cheers,
/-\

dizzyone · August 12, 2007, 2:06pm

Thanks for all your suggestions, it helped me a lot to learn more about
Ruby’s library. I didn’t know that there are so many handy functions

And about the temporary file, I’m using it only for private purposes and
I didn’t want to bother with creating a temporary file in my first
attempt to write a ruby program which can sort these log files.

Thank you all!

Turing

dizzyone · August 12, 2007, 4:03pm

On 8/12/07, Andrew S. [email protected] wrote:

— William J. [email protected] wrote:

Of course. But I’m willing to take that miniscule chance when
I’m doing a write to a small file that takes a fraction of a
second.

That may be an acceptable risk for a program written for private use.
Not so for a production program. After all, impatient users often
press [CTRL-C] in my experience, and that could cause corruption
if it occurred while the file was being rewritten.

You can of course capture that, but you’re write that it’s creating
additional unnecessary work.

dizzyone · August 12, 2007, 10:31am

On 12.08.2007 00:27, William J. wrote:

following format:
file.puts array.sort_by{|s| s[/^\d+/].to_i }
array.sort_by{|s| s[/^\d+/].to_i } }’ blah.txt

file3 file3 file3

file0 file0 file0
ruby -i.bak -e ‘puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}’ file

It’s my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed.

Correct.

Doesn’t this cause unnecessary disk
fragmentation?

Huh? Are you still on MS DOS? I haven’t heard someone worry about disk
fragmentation in ages. I don’t think that this is an issue for any
modern file system.

Less memory usage:

ruby -i.bak -e ‘puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}’ file

Of course, you’re trading speed for memory.

Where exactly do you see that trade off? I was trading elegance for
memory. Sure there are effects, that could make one or the other
solution faster but if I would be really worrying about speed then I’d
use “sort” anyway.

Kind regards

robert