Memory allocation woes

Marcus_DSLeech · June 29, 2010, 3:16am

Spent some time tracking down a memory allocation issue. The SYSV shm
allocator was getting errors on
a request for 1.56GB. Now, it turns out that the segment size used by
SYSV shm uses a signed 32-bit
int for the size of the segment, which means you can’t allocate
segments larger than 2*31 bytes. But
why was the request for 1.56GB being blown away? Because the SYSV shm
allocator in Gnu Radio
multiplies the request size by 2 before asking the system for that
much shared memory.

So, why does it do that? And why is the memory allocation so incredibly
piggish? We went through this
question a couple of years ago, and I’m running into similar problems
again–my application uses
HUGE FFTs–1Hz resolution at up to 16MHz (USRP1) or 25MHz (USRP2)
bandwidth. This, not surprisingly,
leads to some large memory requirements, but Gnu Radios allocator
seems to allocate a significant amount
more than is really needed.

In the case cited above, the FFT size was 6M bins, which granted is
outside the “usual” range of most
Gnu Radio applications, but I was able to make this work last year up
to 16M bins.

The flowgraph involved is quite simple. A source, a short FFT-based
filter, the main HUGE FFT, and then
a complex-to-mag**2, then a file sink.

Should this really require gigabytes of memory?

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · June 29, 2010, 11:26am

This is just an educated guess, so anyone, please correct me if I am
wrong:

GR tries to hide the fact that even for a ringbuffer, memory space is
always
linear. Now lets assume you try to do an fft with overlap, with an fft
size as
large as your buffer.

The first time, you memory area might be at the beginning of the buffer.
The
next time, it is shifted somewhat into the buffer, therefor the last
part of
your buffer is actually at the beginning of the buffer. To get a linear
(with
monotonically increasing adresses) buffer, the physical memory pages
backing
your buffer are mapped into virtual memory space twice, where the second
mapping directly follows the first.

Physical memory usage is equal to buffer size, whereas virtual memory
space
uses twice as much memory. So for current applications on todays
computer it
is really easy to run out of virtual memory space.

If your demands are really so high, why aren’t you using a 64bit
machine. On
top of the larger memory space, you get more registers and guranteed
existence
of SSE (only an issue, if you use prebuilt packages). The last time I
used GR,
it worked fine on 64bit.

Stefan

Marcus_DSLeech · June 29, 2010, 3:51pm

On 06/29/2010 05:23 AM, Stefan Bruens wrote:

If your demands are really so high, why aren’t you using a 64bit machine. On
top of the larger memory space, you get more registers and guranteed existence
of SSE (only an issue, if you use prebuilt packages). The last time I used GR,
it worked fine on 64bit.

Stefan

I am using a 64-bit machine. The problem appears to be that the SYSV
shm interface uses
signed 32-bit values for the size of the desired segment, which tops
out around 2Gbyte.

–
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · June 29, 2010, 8:19pm

On 06/29/2010 01:37 PM, Eric B. wrote:

Add the debugging output and let us know what you find.

allocate_buffer complex_to_mag(1) nitems 4 item_size 25000000
gr_buffer::allocate_buffer: warning: tried to allocate
4 items of size 25000000. Due to alignment requirements
64 were allocated. If this isn’t OK, consider padding
your structure to a power-of-two bytes.
On this platform, our allocation granularity is 4096 bytes.
gr_vmcircbuf_sysv_shm: shmget (1): Invalid argument
gr_buffer::allocate_buffer: failed to allocate buffer of size 1562500 KB
terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc

This is the complex_to_mag(1) right after the HUGE FFT.

That item size must be the size of the output vector (6M *
sizeof(float) + ‘slop’??).

Now, see what happens? It’s rounding up that 4 items to 64 items, which
is ballooning out
the request size considerably!!

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · June 29, 2010, 7:39pm

On Mon, Jun 28, 2010 at 09:14:34PM -0400, Marcus D. Leech wrote:

In the case cited above, the FFT size was 6M bins, which granted is
outside the “usual” range of most
Gnu Radio applications, but I was able to make this work last year up
to 16M bins.

The flowgraph involved is quite simple. A source, a short FFT-based
filter, the main HUGE FFT, and then
a complex-to-mag**2, then a file sink.

Should this really require gigabytes of memory?

Maybe.

The actual amount allocated depends on a bunch of factors, including
the characteristics of the blocks upstream and downstream from the
buffer in question.

In general, there’s a factor of 2x to allow double buffering. When
using any of the vmcircbuf allocators, there is a temporary request
for twice the desired memory. This is to ensure that we get a big
enough hole to map what we really want (1/2 that much) into the VM
space twice, contiguously. This is to allow us to implement the
circular buffer by mapping the same memory into the VM twice, avoiding
the need to special case the boundary condition.

Do you know if it’s the buffer upstream or downstream from the FFT
that’s failing to be allocated? Is the block downstream from the FFT
a decimator? In that case, we allocate N (where N is the decimation
factor) * to ensure that
there’s enough input for the downstream block to generate a single
output on one call to work.

In general that’s not a problem, but in your case, this could be the
part that’s causing the large allocation. It could be worked around
by writing a version of the decimator that doesn’t require the factor
of N on its input. This requires more complicated state tracking in
the block (which is why we don’t normally do it), but is possible.

The place to add some debugging outout would be
gr_flat_flowgraph.cc::allocate_buffer. Something like:

std::cout << "allocate_buffer " << block
<< " nitems " << nitems << " item_size " << item_size <<
std::endl;

For the case of the 6M bin FFT, I could see us getting to

6M * sizeof(gr_complex) * 2 * 2 --> 192MB. Not that big.

Add the debugging output and let us know what you find.

Also, you could try using an alternate vmcircbuf allocator,
gr_vmcircbuf_mmap_shm, by writing the string
“gr_vmcircbuf_mmap_shm_open_factory”
into ~/.gnuradio/prefs/gr_vmcircbuf_default_factory. Note that there
is no trailing newline in that file. The last byte of the file
should by “y”.

gr_vmcircbuf_mmap_shm_open uses an off_t which is 64-bits.
Details in gr_vmcircbuf_mmap_shm_open.cc

Eric

Marcus_DSLeech · June 30, 2010, 12:58am

On Tue, Jun 29, 2010 at 02:16:24PM -0400, Marcus D. Leech wrote:

6M * sizeof(gr_complex) * 2 * 2 --> 192MB. Not that big.
On this platform, our allocation granularity is 4096 bytes.
gr_vmcircbuf_sysv_shm: shmget (1): Invalid argument
gr_buffer::allocate_buffer: failed to allocate buffer of size 1562500 KB
terminate called after throwing an instance of ‘std::bad_alloc’
what(): std::bad_alloc

This is the complex_to_mag(1) right after the HUGE FFT.

That item size must be the size of the output vector (6M *
sizeof(float) + ‘slop’??).

There won’t be any slop in the item_size. That comes from the
io_signature. 25,000,000/sizeof(gr_complex) = 3,125,000. Does that
sound familiar?
It must be the number of bins in the FFT.

Now, see what happens? It’s rounding up that 4 items to 64 items, which
is ballooning out
the request size considerably!!

If you pick a size with more factors of 2 and fewer factors of 5, life
will get better

I have on occasion thought that it would be a good idea to switch to
an alternate circular buffer strategy when the size blows up too much
because of the alignment requirement. The alternate would probably
use memcpy to duplicate the appropriate portion of the buffer on
return from general_work instead of the MMU trick. I probably won’t
get to it this lifetime, but if you’re interested, let me know and
I’ll give you my ideas on how to go about it.

Eric

Marcus_DSLeech · July 1, 2010, 9:51pm

On 06/29/2010 06:56 PM, Eric B. wrote:

I’ll give you my ideas on how to go about it.

Eric

OK, so I"m now seriously considering shifting the spectral processing
portion of my app out of
Gnu Radio because of the memsplosion issues, so pointers to where to
start looking into this
would be of some significant use!

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · July 1, 2010, 10:47pm

On Thu, Jul 01, 2010 at 03:21:44PM -0400, Marcus D. Leech wrote:

get to it this lifetime, but if you’re interested, let me know and
would be of some significant use!
OK, OK, OK. Some things to think about:

First, memory is very cheap. E.g., 4GB DDR3-1600 is $130.
ECC is a bit more, but not crazy expensive.

The effort to “fix” this is on the order of a few days. YMMV.

Using gr_vmcircbuf_mmap_shm_open removes the 32-bit buffer size
limitation.

The resident set size will not increase even though the amount of VM
that’s required explodes by the factor of 16. Another reason not to
worry.

Here’s where to start your study:

All the action is gnuradio-core/src/lib/runtime.

See in particular:
gr_buffer.{h,cc} // single writer, multiple reader FIFO
gr_block_executor.{h,cc} // code that calls forecast &
general_work
gr_flat_flowgraph.{h,cc} // allocate_buffer

It’ll need good QA code of course.

Eric