Premature Optimization

Hi everybody,

I’m writing a fairly open-ended question. I’m hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

The first task I need to accomplish is figuring out how many times a 1
occurs in the ith entry in an array. So for concreteness, if I had
arrays:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

I’m just trying to give the general flavor of what I’m working on. I
know I can use some simple each_with_index loops to increment
count[index] (something along the lines of:)

count = Array.new(m,0)
[first, second, third].each do |array|
array.each_with_index do |item, index|
count[index] += item
end
end

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I’m fairly new to Ruby, though not to scientific
computation. I realize I won’t be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

Thanks!
'cid 'ooh

[email protected] wrote:

Hi everybody,

I’m writing a fairly open-ended question. I’m hoping for suggestions,
opinions, advice. Suppose I have n arrays, each of which has m
entries. m is a fairly large integer, on the order of 10,000. Each
entry is either 1 or 0.

Is there a slick way to do this with unpacking and packing? Or some
other way to do this with strings? Any modules or libraries I should
look into? I’m fairly new to Ruby, though not to scientific
computation. I realize I won’t be able to do this any faster than with
m * n * (a constant multiplier), but I need that constant multiplier to
be as small as possible. Any advice would be appreciated.

I would look out for an implementation of a BitSet as this data
structure seems crucial for your application. You could even create one
your own - it’s not too hard.

Kind regards

robert

On 10/25/06, [email protected] [email protected] wrote:

occurs in the ith entry in an array. So for concreteness, if I had arrays:

 NArray.byte(5):

count[index] += item

What’s it going to take to get this into the stdlib? It is awesome.

On Thu, 26 Oct 2006 [email protected] wrote:

first = [1,0,0,0,0]
second = [1,1,0,0,0]
third = [0,0,0,1,0]

I would end up with
count = [2,1,0,1,0]

 harp:~ > cat a.rb
 require 'narray'

 first = NArray.to_na [1,0,0,0,0]
 second = NArray.to_na [1,1,0,0,0]
 third = NArray.to_na [0,0,0,1,0]

 count = first.eq(1) + second.eq(1) + third.eq(1)

 p count


 harp:~ > ruby a.rb
 NArray.byte(5):
 [ 2, 1, 0, 1, 0 ]

There are going to be m * 3n * (two constant multipliers for the
looping) object allocations and method calls. m is fairly large, and I
have other, similar, tasks to accomplish with this data. The faster I
can process this data, the more data I can process in a given amount of
time, and the more accurate the analysis will be.

i use narray on huge (> 1gb) datasets all the time. it’s blindingly
fast.

regards.

-a

On Thu, 26 Oct 2006, Wilson B. wrote:

What’s it going to take to get this into the stdlib? It is awesome.

it sure it. i’ve be campaigning for years to no avail… maybe it’s
time to
bug matz again? it even compiles on windows: both it and the gsl can be
found
precompiled here:

http://codeforpeople.com/lib/ruby/rb-gsl-win/

regards.

-a

Someone already mentioned the NArray way; if you treat the entire
thing as a set of matrix operations, you can make it insanely fast in
either NArray or GSL. But don’t treat it as “N” arrays with “M”
entries – treat it as a matrix, and then all of your operations can be
done in the external (very fast) code:

m = GSL::Matrix.new([1, 0, 0, 0, 0], [1, 1, 0, 0, 0], [0, 0, 0, 1, 0])

You can then either cheat, and “flatten” the matrix into a vector and
find the sum:

puts m.to_v.sum

(which works since they’re ones)

Or pretend you’re in Matlab and do it as a set of matrix operations:

colones = GSL::Matrix.new(1, m.size[1])
colones[0].set_all(1)

rowones = GSL::Matrix.new(1, m.size[0])
rowones[0].set_all(1)

puts ((colones * m.transpose) * rowones.transpose)[0,0]

-Dave

[email protected] wrote:

Hi everybody,

I’m writing a fairly open-ended question. I’m hoping for suggestions,
opinions, advice.

Thanks for all your replies. I’ve worked with the Gnu Scientific
Library (and ATLAS, too) before, but I didn’t realize there were Ruby
binding for it. I settled on a Ruby/GSL with a custom compiled ATLAS
backend (instead of just BLAS like GSL uses normally). It’s screaming
fast.

'cid 'ooh

Tweaking a bit and benchmarking with 100 rows of 10,000 elements each

                         user     system      total        real

Create Matrix 3.940000 0.090000 4.030000 ( 4.037665)
Add Up Matrix 0.710000 0.000000 0.710000 ( 0.711897)
Create GSL::Matrix 3.400000 0.050000 3.450000 ( 3.496827)
Add GSL::Matrix 0.080000 0.030000 0.110000 ( 0.111982)
Matrix Add GSL::Matrix 0.020000 0.000000 0.020000 ( 0.017661)

With the “Matrix Add” done as:

(@m * colones.transpose).to_v.sum

(to avoid @m.transpose, which creates a copy of all of @m).

Have fun. Note that there are faster ways to initialize a GSL array
from an external source, if you end up concerned about that source of
overhead. (Or an NArray).

-Dave