Good or best way to allocate a large array

Newbie here:

I will want/need to create an array of approximately 300K rows and 20
columns. The array will be an array of either 32-bit quantities or
64-bit quantities.

Does Ruby support native 64-bit quantities?

What’s a good/best way to to create such an array. Is there a way to
tell Ruby, “OK, I’m done with this array.”

Can one create such an array such that the garbage collector can
free the memory in one fast operation?

Ralph

Ralph S. wrote:

Newbie here:

I will want/need to create an array of approximately 300K rows and 20
columns. The array will be an array of either 32-bit quantities or
64-bit quantities.

Does Ruby support native 64-bit quantities?

What’s a good/best way to to create such an array. Is there a way to
tell Ruby, “OK, I’m done with this array.”

Can one create such an array such that the garbage collector can
free the memory in one fast operation?

Ralph

I don’t know when arrays stop being as performant as one could wish. For
a problem of this size, it seems to me that one may want instead to use
a database, such as sqlite, or a stronger beast? I expect it’ll make
your code more readable, too.
Is there a particular reason to not make the jump to a database?

Ralph S. wrote:

Newbie here:

I will want/need to create an array of approximately 300K rows and 20
columns. The array will be an array of either 32-bit quantities or
64-bit quantities.

Dude, you want a database. There’s little reason in any language to
have an array that big.

Does Ruby support native 64-bit quantities?

I’m not sure, but with transparent Bignum support, it doesn’t matter
that much.

What’s a good/best way to to create such an array. Is there a way to
tell Ruby, “OK, I’m done with this array.”

Can one create such an array such that the garbage collector can
free the memory in one fast operation?

The answer to both questions: as far as I know, it will be deallocated
when it goes out of scope.

Ralph

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Marnen,

I am a Ruby newbie but as far as I know, memory gets freed when the
garbage collector gets around to doing it.

Ralph

MLK> The answer to both questions: as far as I know, it will be
deallocated
MLK> when it goes out of scope.

Marnen,

Unless “scope” means something different in Ruby than it does in other
languages, it would appear that the garbage collector cannot possibly
free memory when something goes out of scope but, instead, when the
UseCount goes to zero.

Ralph

Monday, November 9, 2009, 12:31:21 PM, you wrote:

MLK> Ralph S. wrote:

Marnen,

I am a Ruby newbie but as far as I know, memory gets freed when the
garbage collector gets around to doing it.

MLK> Right. And as far as I know, it will do that when the object goes
out
MLK> of scope.

Ralph

MLK> The answer to both questions: as far as I know, it will be
deallocated
MLK> when it goes out of scope.
MLK> Best,
MLK> –
MLK> Marnen Laibow-Koser
MLK> http://www.marnen.org
MLK> [email protected]

Ralph S. wrote:

Marnen,

Unless “scope” means something different in Ruby than it does in other
languages, it would appear that the garbage collector cannot possibly
free memory when something goes out of scope but, instead, when the
UseCount goes to zero.

Except that uses aren’t counted in Ruby. Ruby has a mark-and-sweep
garbage collector. When Ruby runs out of memory a garbage collection run
is triggered. If there aren’t any references to the objects in question
left, they aren’t reachable from the root set and thus not marked in the
first phase of GC, in the second phase they will then (too be
conservative: most likely) be freed.

Ralph S. wrote:

Marnen,

I am a Ruby newbie but as far as I know, memory gets freed when the
garbage collector gets around to doing it.

Right. And as far as I know, it will do that when the object goes out
of scope.

Ralph

MLK> The answer to both questions: as far as I know, it will be
deallocated
MLK> when it goes out of scope.
Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Robert,

Doesn’t that just move things into a BiNum … which is not native 64?

Ralph

Monday, November 9, 2009, 2:35:26 PM, you wrote:

Does Ruby support native 64-bit quantities?

RK> Yes. You can try it out

RK> ruby1.9 -e ‘100.times {|i| x = 1 << i;p x, x.class}’

On 11/09/2009 06:25 PM, Ralph S. wrote:

Newbie here:

I will want/need to create an array of approximately 300K rows and 20
columns. The array will be an array of either 32-bit quantities or 64-bit quantities.

So we are talking about 300,000 * 20 * 8 ~ 48MB which does not sound too
large.

robert@fussel:~$ time ruby1.9 -e ‘x=1<<63;Array.new(300_000*20) {x}’

real 0m1.922s
user 0m1.288s
sys 0m0.032s
robert@fussel:~$ time ruby1.9 -e ‘x=1<<63;Array.new(300_000*20) {1<<63}’

real 0m7.143s
user 0m6.668s
sys 0m0.204s

Does Ruby support native 64-bit quantities?

Yes. You can try it out

ruby1.9 -e ‘100.times {|i| x = 1 << i;p x, x.class}’

What’s a good/best way to to create such an array. Is there a way to
tell Ruby, “OK, I’m done with this array.”

No need to: if there are no more live references to that Array it will
get collected eventually.

Can one create such an array such that the garbage collector can
free the memory in one fast operation?

Probably not because of the number of objects in the Array (if they are
all Fixnums it will probably a bit faster.

The question is, what do you do with that huge Array? Chances are that
you need to allocate more memory during processing. What’s your use
case? A database might be an option or a flat file - it all depends.

Kind regards

robert

Florian,

If there is no use count, how does Ruby know that there are no
references to the object that should be freed?

Ralph

Monday, November 9, 2009, 3:02:37 PM, you wrote:

FF> Ralph S. wrote:

Marnen,

Unless “scope” means something different in Ruby than it does in other
languages, it would appear that the garbage collector cannot possibly
free memory when something goes out of scope but, instead, when the
UseCount goes to zero.

FF> Except that uses aren’t counted in Ruby. Ruby has a mark-and-sweep
FF> garbage collector. When Ruby runs out of memory a garbage collection
run
FF> is triggered. If there aren’t any references to the objects in
question
FF> left, they aren’t reachable from the root set and thus not marked in
the
FF> first phase of GC, in the second phase they will then (too be
FF> conservative: most likely) be freed.

On Mon, Nov 9, 2009 at 5:46 PM, Ralph S. [email protected] wrote:

Florian,

If there is no use count, how does Ruby know that there are no
references to the object that should be freed?

As Florian said, MRI ruby uses a mark-sweep GC algorithm, when it
needs to GC it traces out references from a set of root objects, and
marks any objects which are reachable recursively.

Then it frees any objects which aren’t marked.

Reference counting might seem simpler, but it is expensive because the
count needs to be maintained everytime a reference changes, and it
can’t reclaim cyclic garbage which can’t be reached from a root:

   a -> b -> c
   ^            |
   *------------+

Mark/Sweep is a fairly primitive GC technique, it was probably the
second technique applied in the long history of GC. There are more
recent techniques such as generation scavenging which makes use of the
observance that most objects in a uniformly object-oriented language
like Ruby either live a very short, or a reasonably long life time
with the preponderance being short-lived. Generation scavenging
collects the short-lived objects very efficiently, and typically uses
mark-sweep less frequently to clean up the older ones.


Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Rick,

Ok … but the point is that the objects are not gc’d when things go
out of scope but, instead, when Ruby needs memory, right?

And if you have a huge array with gobs of objects, the gc is gonna
take a while?

Ralph

Monday, November 9, 2009, 4:06:23 PM, you wrote:

RD> On Mon, Nov 9, 2009 at 5:46 PM, Ralph S. [email protected]
wrote:

Florian,

If there is no use count, how does Ruby know that there are no
references to the object that should be freed?

RD> As Florian said, MRI ruby uses a mark-sweep GC algorithm, when it
RD> needs to GC it traces out references from a set of root objects, and
RD> marks any objects which are reachable recursively.

RD> Then it frees any objects which aren’t marked.

RD> Reference counting might seem simpler, but it is expensive because
the
RD> count needs to be maintained everytime a reference changes, and it
RD> can’t reclaim cyclic garbage which can’t be reached from a root:

RD> a → b → c
RD> ^ |
RD> *------------+

RD> Mark/Sweep is a fairly primitive GC technique, it was probably the
RD> second technique applied in the long history of GC. There are more
RD> recent techniques such as generation scavenging which makes use of
the
RD> observance that most objects in a uniformly object-oriented language
RD> like Ruby either live a very short, or a reasonably long life time
RD> with the preponderance being short-lived. Generation scavenging
RD> collects the short-lived objects very efficiently, and typically
uses
RD> mark-sweep less frequently to clean up the older ones.

Ralph S. wrote:

Rick,

Ok … but the point is that the objects are not gc’d when things go
out of scope but, instead, when Ruby needs memory, right?

And if you have a huge array with gobs of objects, the gc is gonna
take a while?

Perhaps. But there’s no reason to do this. Use a database. Even a
file-based DB like SQLite will make your life easier.

Ralph

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Hi,

You can manually kickstart the GC by using GC.start.

module GC - RDoc Documentation

Otherwise, AFAIK, the MRI GCs before breaking certain memory barriers,
but
don’t quote me on that. (I think it does that by counting mallocs and
starting the GC once a certain number is hit)

This actually explains this behavior:

$ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {1<<63}’
real 0m4.091s
user 0m3.430s
sys 0m0.361s

$ time ruby19 -e ‘GC.disable; x=1<<63;Array.new(300_000*20) {1<<63}’

real 0m2.489s
user 0m1.845s
sys 0m0.565s

But be aware though, that this is not caused by the allocation
of the array:

$ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {x}’

real 0m0.846s
user 0m0.735s
sys 0m0.049s

$ time ruby19 -e ‘GC.disable; x=1<<63;Array.new(300_000*20) {x}’

real 0m0.778s
user 0m0.710s
sys 0m0.052s

So, before (knowingly) breaking those limits, it might be an option
to disable the GC. Handle with care, though.

Regards,
Florian

Florian,

Again, I am a newbie but …

What I think you are saying is that the difference between:

$ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {1<<63}’
real 0m4.091s

and

$ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {x}’
real 0m0.846s

is that the first initializes 300_00020 BigNums and the second
initializes 300_000
20 references to a BigNum.

Ralph

Monday, November 9, 2009, 7:32:06 PM, you wrote:

FG> Hi,

FG> You can manually kickstart the GC by using GC.start.

FG> module GC - RDoc Documentation

FG> Otherwise, AFAIK, the MRI GCs before breaking certain memory
barriers,
FG> but
FG> don’t quote me on that. (I think it does that by counting mallocs
and
FG> starting the GC once a certain number is hit)

FG> This actually explains this behavior:

FG> $ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {1<<63}’
FG> real 0m4.091s
FG> user 0m3.430s
FG> sys 0m0.361s

FG> $ time ruby19 -e ‘GC.disable; x=1<<63;Array.new(300_000*20)
{1<<63}’

FG> real 0m2.489s
FG> user 0m1.845s
FG> sys 0m0.565s

FG> But be aware though, that this is not caused by the allocation
FG> of the array:

FG> $ time ruby19 -e ‘x=1<<63;Array.new(300_000*20) {x}’

FG> real 0m0.846s
FG> user 0m0.735s
FG> sys 0m0.049s

FG> $ time ruby19 -e ‘GC.disable; x=1<<63;Array.new(300_000*20) {x}’

FG> real 0m0.778s
FG> user 0m0.710s
FG> sys 0m0.052s

FG> So, before (knowingly) breaking those limits, it might be an option
FG> to disable the GC. Handle with care, though.

FG> Regards,
FG> Florian

FG> On Nov 9, 2009, at 9:05 PM, Ralph S. wrote:

Rick,

Ok … but the point is that the objects are not gc’d when things go
out of scope but, instead, when Ruby needs memory, right?

And if you have a huge array with gobs of objects, the gc is gonna
take a while?

Please do not top post.

2009/11/9 Ralph S. [email protected]:

Doesn’t that just move things into a BiNum … which is not native 64?

Well, define “native”. I chose to interpret it as “does Ruby come
with built in, C implemented support for 64 integers” - which it does.
:slight_smile:

Cheers

robert

Ralph S. wrote:

I will want/need to create an array of approximately 300K rows and 20
columns. The array will be an array of either 32-bit quantities or
64-bit quantities.

Does Ruby support native 64-bit quantities?

If you want to hold this all in RAM, then I think your most
space-efficient way would be to pack each row of 20 columns into a
String (of 80 or 160 bytes).

You can use Array#pack and String#unpack to convert when required.

val1 = 0x0102030405060708
val2 = 0xf0e0d0c0b0a09080
row = [val1, val2].pack(“Q*”)
p row.unpack(“Q*”)

Q* is for unsigned 64-bit. Use q* for signed 64, L* or l* for
unsigned/signed 32 bit. With 32-bit there are also options for using a
fixed byte ordering (little or big endian) rather than native byte
ordering.

Warning: take care if using ruby 1.9, because 1.9 Strings are of
“characters” and not necessarily of “bytes”. This may become critical if
you decide to write these structures out to disk and read them back in
again. Details at

Ralph,

Well, that’s a minor point. I just reused Roberts examples because
I expected that to be rather clear.

The thing is that an Array of size (300_000*20) already has as many
slots for references. So it’s not “initializing references”. It just
makes every field of the Array referencing that one BigNum. There is
no allocation or initialization happening in this step.

As you correctly stated, the first example allocs (300_000*20 + 1)
objects. The interesting point in this is that it triggers the
GC more than once, although there is nothing to collect.

Regards,
Florian

On Nov 10, 2009, at 4:30 AM, Ralph S. wrote:

and

FG> Otherwise, AFAIK, the MRI GCs before breaking certain memory
FG> user 0m3.430s
FG> of the array:
FG> real 0m0.778s

Ralph


Best regards,
Ralph mailto:[email protected]


Florian G.

smtp: [email protected]
jabber: [email protected]
gpg: 533148E2

2009/11/10 Florian G. [email protected]:

The thing is that an Array of size (300_000*20) already has as many
slots for references. So it’s not “initializing references”. It just
makes every field of the Array referencing that one BigNum. There is
no allocation or initialization happening in this step.

Right. Only the single BigNum and the Array need to be allocated.

As you correctly stated, the first example allocs (300_000*20 + 1)
objects. The interesting point in this is that it triggers the
GC more than once, although there is nothing to collect.

Well, but this is expected: the second example has more need for
memory (every time a new BigNum must be created), and if memory is
needed it is reasonable to first start GC before asking the OS for
more memory. GC cannot know that it will not find anything
collectible.

Kind regards

robert

Florian,

FG> The thing is that an Array of size (300_000*20) already has as many
FG> slots for references. So it’s not “initializing references”. It just
FG> makes every field of the Array referencing that one BigNum. There is
FG> no allocation or initialization happening in this step.

There has to be initialization, doesn’t there?

irb(main):001:0> a=Array.new(10)
=> [nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]
irb(main):002:0> a[5]
=> nil