Is high-speed sorting impossible with Ruby?

gaurav_c · November 28, 2011, 3:16pm

Douglas S. wrote in post #1034017:

a = Array.new(1e6+1) {|n| “”}

The above creates an Array of 1,000,001 slots. The block syntax lets
you
provide a default object dynamically. The first time you reference a
particular index in the array, the block is called to allocate the
default
object at that position.

Hi Doug,

Thanks for the explanation.
Didn’t understand the Array initialization part.
What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

gaurav_c · November 28, 2011, 3:35pm

On Nov 28, 2011, at 3:17 PM, Gaurav C. wrote:

object at that position.

Hi Doug,

Thanks for the explanation.
Didn’t understand the Array initialization part.
What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

The former always returns a new string instance (this is what you want);
the former creates one string and always returns a reference.

gaurav_c · November 28, 2011, 3:33pm

Ryan D. wrote in post #1034037:

On Nov 27, 2011, at 09:51 , Douglas S. wrote:

$stdin.gets
puts $stdin.readlines.map(&:to_i).sort

Neither solution seems fast enough to reach their arbritrary deadline of
5 seconds (with their data)… but it is good enough for me. I love the
fact that I can write a 1-2 liner and be done and onto the next problem
while others are still bit-twiddling in their lower level languages…

Hi Ryan,

Totally makes sense to move on instead of trying to fit into constraints
which are actually not meant for your choice of language. Also, could
you explain your code a bit (have just started using Ruby since few
days)?

$stdin.gets
puts $stdin.readlines.map(&:to_i).sort

It’s interesting to take all the input at once but I run the code above
and it seems like it never stops taking input?

gaurav_c · November 28, 2011, 3:51pm

On 28.11.2011 15:17, Gaurav C. wrote:

What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

The first one creates one array with 1e6+1 different empty strings.
The second one results in an array that holds 1e6+1 references to the
same empty string.

Note that the following code variants both work as expected:

a = Array.new(1e6+1) {""}

…
a[l.to_i] << l
…

a = Array.new(1e6+1, “”)

…
a[l.to_i] += l
…

but not this one:

a = Array.new(1e6+1, “”)

…
a[l.to_i] << l
…

String#+ creates a new string object while String#<< modifies the
current string object.

– Matthias

gaurav_c · November 28, 2011, 3:53pm

On Nov 28, 2011, at 3:33 PM, Gaurav C. wrote:

It’s interesting to take all the input at once but I run the code above
and it seems like it never stops taking input?

$stdin.readlines will return when it reads EOF; to my understanding,
that’s why Ryan said the input data needs to be ‘honest’ in his previous
post. On a Unix system you can test this by pressing Ctrl-D (to send EOF
to the input stream).

gaurav_c · November 28, 2011, 5:10pm

2011/11/28 Matthias W. removed_email_address@domain.invalid:

On 28.11.2011 15:17, Gaurav C. wrote:

What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

The first one creates one array with 1e6+1 different empty strings. The
second one results in an array that holds 1e6+1 references to the same
empty string.

Another way to visualize this:

irb(main):001:0> Array.new(4, “”).map(&:object_id)
=> [135176100, 135176100, 135176100, 135176100]
irb(main):002:0> Array.new(4, “”).map(&:object_id).uniq
=> [135107180]
irb(main):003:0> Array.new(4){“”}.map(&:object_id)
=> [134679150, 134679140, 134679110, 134679080]
irb(main):004:0> Array.new(4){“”}.map(&:object_id).uniq
=> [134358730, 134358720, 134358710, 134358670]

Kind regards

robert

gaurav_c · November 28, 2011, 11:58pm

On Nov 28, 2011, at 06:33 , Gaurav C. wrote:

$stdin.gets
puts $stdin.readlines.map(&:to_i).sort

It’s interesting to take all the input at once but I run the code above
and it seems like it never stops taking input?

It depends on how you’re running the code. I was doing it like:

blah.rb < inputfile

$stdin.readlines reads until input EOFs.

gaurav_c · November 29, 2011, 12:03am

On Nov 28, 2011, at 06:34 , Sylvester K. wrote:

provide a default object dynamically. The first time you reference a
Array.new(1e6+1, “”)?

The former always returns a new string instance (this is what you want); the
former creates one string and always returns a reference.

I don’t think this really explains it adequately. The former initializes
the array by running the block on every index. The latter assigns the
same object to every index.

From ri:

Array.new(size=0, obj=nil)
Array.new(array)
Array.new(size) {|index| block }

Returns a new array. In the first form, the new array is empty. In the
second
it is created with size copies of obj (that is, size
references to the same obj). The third form creates a copy of the array
passed as a parameter (the array is generated by calling to_ary on the
parameter). In the last form, an array of the given size is created.
Each
element in this array is calculated by passing the element’s index to
the
given block and storing the return value.

gaurav_c · November 29, 2011, 12:00am

On Nov 28, 2011, at 06:49 , Sylvester K. wrote:

$stdin.readlines will return when it reads EOF; to my understanding, that’s why
Ryan said the input data needs to be ‘honest’ in his previous post. On a Unix
system you can test this by pressing Ctrl-D (to send EOF to the input stream).

If it isn’t honest, you can still address this by using
Array#first(size). You’ll read in more than necessary, but it is still
worth the boost by only doing 1 IO.

gaurav_c · November 29, 2011, 12:22am

Just playing with the code a little, more education than speed.

On 28.11.2011 09:05, Ryan D. wrote:

$stdin.gets
puts $stdin.readlines.map(&:to_i).sort

Your solution could be written like

$stdin.gets

a = $stdin.readlines
b = a.map(&:to_i)
c = b.sort

puts c

while processing of a, b, and c is completely sequential. All three
objects created are arrays full of data: a contains all string lines
read in, b makes all these entries into integers, c is the same sorted.
Three large arrays. All sequentially processed.

A similar solution, at first glance only marginally different, is the
following:

$stdin.gets
puts $stdin.each_line.sort_by(&:to_i)

Besides the fact that this solution is slower than yours, it uses a neat
little technique of Ruby 1.9, Enumerators. Here, only a single big array
is created, which is the result array. IO#each_line() does not read in
the whole file at once but provides an Enumerator – iteration fiber –
that passes each read line to the .sort_by() call while they are read.
Only sort_by will exhaustively fetch the items and eventually sort them
based on the sorting criteria.

Driving pipelining with the Enumerators to the max, we need a filter
chain. Let’s set up a generic filter (which is still not there in
Enumerator?) similar to the one described ages ago in issue #707
[Feature #707: Documentation for Enumerator chaining - Ruby master - Ruby Issue Tracking System]

$stdin.gets
puts $stdin.each_line.filter1(&:to_i).sort

Let’s expand the last line into smaller chunks, similar to the
explanation of your code:

a = $stdin.each_line
b = a.filter1(&:to_i)
c = b.sort

puts c

Now, walking the code, nothing really happens until we hit b.sort – the
creation of the objects a and b will just setup the filter chain and not
yet pull any data from $stdin. Once called on an Enumerator, #sort
cannot perform its action and return even a single entry before knowing
all entries, so it will call up b.next until it receives a
»StopIteration« exception. But for b to return a new value, it first has
to apply .to_i to the next value on its input, which is again an
iterator, a. So next is the call to a.next which will return the first
line read from $stdin. Finally we can walk our way back to sort: The
vaue is returned to the filter so it can be converted (to_i) and
returned from b.next. Finally the number can take part in the sorting
process.

In theory, if all these operations were considered non-destructive,
.each_line, .filter1 and .sort could run in three different threads
asynchronously, with buffers in between, utilizing the whole CPU and
probably reducing the overall execution time. But this is not the case:
There must not be any read-ahead by any of these designated threads,
drastically limiting their achievable concurrency performance from the
beginning.

End of the Enumerator lesson.

– Matthias

gaurav_c · November 29, 2011, 7:56am

On 29.11.2011 02:05, Josh C. wrote:

little technique of Ruby 1.9, Enumerators.

$stdin is already enumerable, so each_line isn’t necessary

puts $stdin.drop(1).sort_by(&:to_i)

Great one-liner, thanks!

But in the context of my posting, it is not following the Enumerator
principle. $stdin.drop(1) creates a large array on its output which I
wanted to avoid.

What we need is an enumeratoresque version of Enumerable#drop, which I
call drop_first. It will only consume on its input

self.class.new do |y|
  with_index do |input,idx|
    y << input if idx >= n
  end
end
end
end

puts $stdin.to_enum(:each_line).drop_first(1).filter1(&:to_i).sort

Performance-wise not better at all, but a nice excursion.

– Matthias

gaurav_c · November 29, 2011, 2:05am

2011/11/28 Matthias W. removed_email_address@domain.invalid

A similar solution, at first glance only marginally different, is the
following:

$stdin.gets

puts $stdin.each_line.sort_by(&:to_**i)

Besides the fact that this solution is slower than yours, it uses a neat
little technique of Ruby 1.9, Enumerators.

$stdin is already enumerable, so each_line isn’t necessary

puts $stdin.drop(1).sort_by(&:to_i)

gaurav_c · November 29, 2011, 8:53am

On 29.11.2011 07:56, Matthias Wächter wrote:

What we need is an enumeratoresque version of Enumerable#drop, which I
call drop_first. It will only consume on its input

(oops, unfinished) … when something is requested on the output, and it
will drop the first ‘n’ items.

– Matthias

gaurav_c · November 29, 2011, 9:36am

2011/11/29 Matthias W. removed_email_address@domain.invalid:

But in the context of my posting, it is not following the Enumerator
principle. $stdin.drop(1) creates a large array on its output which I wanted
to avoid.

How do you avoid having an Array with all items when sorting? If you
want to do that you need more advanced sorting algorithms (external
sorting) where not all the data is in main memory all the time.

end
puts $stdin.to_enum(:each_line).drop_first(1).filter1(&:to_i).sort

Performance-wise not better at all, but a nice excursion.

Why so complicated? You can just do

module Enumerable
def drop1(n)
each_with_index {|x,i| yield x if i >= n}
end
end

Or, also callable without a block:

module Enumerable
def drop1(n)
if block_given?
each_with_index {|x,i| yield x if i >= n}
else
enum_for(:drop1, n)
end
end
end

irb(main):037:0> Source.new.drop1(2) {|i| printf “*block %2d\n”, i}
before 0
after 0
before 1
after 1
before 2
*block 2
after 2
before 3
*block 3
after 3
before 4
*block 4
after 4
=> #Source:0x102b4890

But you can achieve the same memory complexity with standard means:

puts $stdin.drop(1).map!(&:to_i).sort!

Here #drop creates the Array once and all other operations work on that:

irb(main):054:0> a=Array.new(10){rand 20}
=> [14, 6, 6, 0, 14, 16, 19, 10, 14, 0]
irb(main):055:0> a.object_id
=> 135532690
irb(main):056:0> a.drop(1).tap {|o| p o.object_id}.map!(&:to_i).tap
{|o| p o.object_id}.sort!.tap {|o| p o.object_id}
134354700
134354700
134354700
=> [0, 0, 6, 6, 10, 14, 14, 16, 19]

Kind regards

robert

gaurav_c · November 29, 2011, 10:49am

On Nov 29, 2011, at 12:02 AM, Ryan D. wrote:

Thanks for the explanation.
Didn’t understand the Array initialization part.
What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

The former always returns a new string instance (this is what you want); the
former creates one string and always returns a reference.

I don’t think this really explains it adequately.

You’re absolutely right; what I was describing was more like default
blocks on Hashes and not the behavior of Array.new.

gaurav_c · November 29, 2011, 9:43am

On Tue, Nov 29, 2011 at 9:35 AM, Robert K.
removed_email_address@domain.invalid wrote:

PS: Forgot to include class Source which I used.

class Source
include Enumerable

def each
5.times do |i|
printf “before %2d\n”, i
yield i
printf “after %2d\n”, i
end
end
end

gaurav_c · November 30, 2011, 9:00pm

What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

Okay. Thanks all who explained this.
But, again, why is it better to use the former in this case?

gaurav_c · December 1, 2011, 1:58am

On Wed, Nov 30, 2011 at 2:00 PM, Gaurav C. removed_email_address@domain.invalid
wrote:

What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

Okay. Thanks all who explained this.
But, again, why is it better to use the former in this case?

–
Posted via http://www.ruby-forum.com/.

As an analogy:

If Jerry, Bob, and sally live in different houses, and you blow Jerry’s
house up, then only Jerry is homeless.
If Jerry, Bob, and Sally live in the same house, and you blow Jerry’s
house
up, then Jerry, Bob, and Sally are all homeless.

As code:

each get’s their own house

houses = Array.new(3) { ‘’ }
houses.first << ‘kaboom’
houses # => [“kaboom”, “”, “”]

each lives in the same house

houses = Array.new 3, ‘’
houses.first << ‘kaboom’
houses # => [“kaboom”, “kaboom”, “kaboom”]

Something you should try yourself: Try initializing each index to a
random
value between 1 and 10 (you can get this number with rand(10)). What
behaviour explains the results you see? Can you write your own method
that
behaves the same way? (If you don’t know how to deal with the blocks,
try
watching Ruby Kickstart session 3 http://ruby-kickstart.com/#session3)

gaurav_c · December 29, 2011, 8:55am

-----Messaggio originale-----
Da: Sylvester K. [mailto:removed_email_address@domain.invalid]
Inviato: marted 29 novembre 2011 10:49
A: ruby-talk ML
Oggetto: Re: Is high-speed sorting impossible with Ruby?

On Nov 29, 2011, at 12:02 AM, Ryan D. wrote:

What’s the difference between Array.new(1e6+1) {|n| “”} and
Array.new(1e6+1, “”)?

The former always returns a new string instance (this is what you want);
the former creates one string and always returns a reference.

I don’t think this really explains it adequately.

You’re absolutely right; what I was describing was more like default
blocks
on Hashes and not the behavior of Array.new.

The former initializes the array by running the block on every index. The
latter assigns the same object to every index.
Returns a new array. In the first form, the new array is empty. In the
second it is created with size copies of obj (that is, size references
to the same obj). The third form creates a copy of the array passed as
a parameter (the array is generated by calling to_ary on the
parameter). In the last form, an array of the given size is created.
Each element in this array is calculated by passing the element’s
index to the given block and storing the return value.

–
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu’ IMAP, POP3 e
SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Riccione Hotel 3 stelle in centro: Pacchetto Capodanno mezza pensione,
animazione bimbi, zona relax, parcheggio. Scopri l’offerta solo per
oggi…
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid982&d)-12