High Flowgraph Latency in 3.6.4.1

addis_a · October 10, 2014, 7:34pm

I thought I would join the list of people emailing the list this week
about
obscure issues with ancient versions of GNU Radio.

I have a flow graph that needs to operate at a variety of datarates from
a
few kbps to about 1 Mbps. Due to the potential for very wide frequency
errors, we still have to sample at >1e6 and decimate for the lower
bitrates.

Toward the end of the receive chain, there are a multitude of blocks
that
are used for Viterbi node synchronization. I’ve found that the number of
blocks in series (3-5), combined with the low datarates at this point in
the flowgraph, lead to latencies on the order of 1-2 minutes. That is
to
say, once the node synchronization is accomplished, it takes 1-2 minutes
to
flush these blocks and get the fresh, good data through. This is
measured
with function probes on the state of the sync process, and BERT analysis
of
the demodulator output [through TCP/IP socket].

Unfortunately, upgrading to 3.7.x isn’t a viable option in the near
term. But have there been any fundamental changes to the scheduler
that
might avoid this problem?
I’ve tried messing around with the output buffer size option in the
flowgraph, but this seems l to have a negligible impact.
I can write some custom blocks to reduce the overall block count,
but
I have demonstrated that this provides a linear improvement, rather
than
the two-order-magnitude improvement I need.

Any general advice anyone can offer? It feels like the right solution
is
to force small buffer sizes on the relevant blocks…

-John

jmaenpaa · October 10, 2014, 8:08pm

Hi John,
On 10.10.2014 19:33, John M. wrote:

Toward the end of the receive chain, there are a multitude of blocks that
are used for Viterbi node synchronization. I’ve found that the number of
blocks in series (3-5), combined with the low datarates at this point in
the flowgraph, lead to latencies on the order of 1-2 minutes. That is to
say, once the node synchronization is accomplished, it takes 1-2 minutes to
flush these blocks and get the fresh, good data through. This is measured
with function probes on the state of the sync process, and BERT analysis of
the demodulator output [through TCP/IP socket].
I see you found the hidden interplanetary signal delay simulator.
Congrats! Watch out for the red shift in downstream samples.

No, seriously, that sounds like a lot.
You are using 3.6.4.1 with the default scheduler, tpb?

I’ve tried messing around with the output buffer size option in the
flowgraph, but this seems l to have a negligible impact.
That surprises me. How did you mess around? top_block->run(1024)?
Do your blocks really get called with smaller input item sizes? (do a
little printf-debugging)

I can write some custom blocks to reduce the overall block count, but
I have demonstrated that this provides a linear improvement, rather than
the two-order-magnitude improvement I need.

Any general advice anyone can offer? It feels like the right solution is
to force small buffer sizes on the relevant blocks…
agreed. But still. That sounds bad. Are you sure none of the block
demands a large input/output multiple?

Greetings,
Marcus

jmaenpaa · October 10, 2014, 8:15pm

Default scheduler.

tb.start(1024), with different values, etc, etc.

Most of the downstream blocks are stock GNU Radio blocks - a delay block
(max delay is 1 sample), logical operators, etc. I guess I’ll add some
printf debugging?

-John

On Fri, Oct 10, 2014 at 11:07 AM, Marcus Müller
[email protected]

jmaenpaa · October 11, 2014, 2:21am

I ran into this problem when doing 57.6kbps BPSK decoding, AX.25. The
only way I was able to fix it was to reduce GR_FIXED_BUFFER_SIZE in
flat_flowgraph.cc.
This is regarded as a dodgy hack by all the GR developers here, but it
worked for me (and I read the article on latency). I believe the guy
who wrote GMSKSpacecraftGroundstation had the same problem, and found
it in one of his old threads.

jmaenpaa · October 10, 2014, 8:57pm

Hey John,

I am way out of my depth here, but while working on a custom python
block
the other day, I saw some weird behavior in 3.7.5 that was similar.
Setting the global max output had no effect, but setting the
just-upstream
block(s) min/max output buffer size(s) low fixed my apparent slowliness.

Very Respectfully,

Dan CaJacob

On Fri, Oct 10, 2014 at 2:14 PM, John M.
<[email protected]

jmaenpaa · October 11, 2014, 2:58am

Yep, that was me. I was getting to pipe-in with the same suggestion.

@(^.^)@ Ed

jmaenpaa · October 11, 2014, 3:40am

On 10/10/14 9:15 PM, John M. wrote:

Sounds dangerous if you also happen to have very high throughput
applications to run? Am I wrong?

Yes, it was a fine line between getting it small enough for acceptable
latency while still maintaining enough throughput to not overrun.
Fortunately for our application we were able to strike that balance.

Your mileage may vary.

@(^.^)@ Ed

jmaenpaa · October 11, 2014, 3:16am

Sounds dangerous if you also happen to have very high throughput
applications to run? Am I wrong?

On Fri, Oct 10, 2014 at 5:59 PM, Ed Criscuolo
[email protected]

jmaenpaa · October 11, 2014, 4:10am

Is there something i can force on a per block basis in 3.6? Is there
some
trickery in the forecast function I might use?

jmaenpaa · October 17, 2014, 6:59pm

So… there were actually several contributors to this long latency.
Some
of it was related to GNU Radio’s inner workings, some were external.
With
all of the “external” things removed, there was still 1+ minute of
latency
at low bitrates.

I thought I would share my findings, for the sake of getting this
experience publicly archived. (and maybe someone can build on these
insights?)

First, a little background on the application. The use case-here is
that
there is some moderately complex that isn’t data-specific per-se. But
the
quality of the demodulated data is evaluated on a periodic basis to
determine alignment through multiple, parallel FEC decoding chains.
After
alignment is detected, an appropriate chain is selected, and downstream
delays are adjusted. Here are specific things that were observed:

We gave up on the stock selector block early on. It was
misaligning
samples, which effected a down-stream interleaving process. What are
the
general thoughts/feelings on how the selector block is currently
implmeneted.
We tried several “clever” (ha) methods to select the desired
stream.
Most of them revolved around the concept of summering/xoring streams
after
multiplying or and’ing the streams according to which stream we
wanted
(operand = 1 if we wanted the stream, 0 if not). Through various
methods
we found that anything with a constant source, and const, etc, create
this
very long latencies. The behavior was this:

After the streams were selected and the long latency expired the data
would be valid, and latency would be quite good for such a low data
rate.
This seemed to indicate that somehting in the blocks used to select
the
stream was causing a delay - ie. inserting a lot of samples with the
previous evaluated results where things would be misaligned. Either
that,
or the callbacks to set the new operands for stream selection were
not
“executing” right away. The specific offenders seem to be 1)
constant
source 2) and constant 3) mult_const. I haven’t figured out why
yet…

The end solution was to do a dumb selector block that just copied the
specified input to the output. Derp…
-John

On Fri, Oct 10, 2014 at 7:09 PM, John M.
<[email protected]

jmaenpaa · October 17, 2014, 7:26pm

Matt,

BTW - in this particular case, this was all in the receive direction -
its a lowrate demodulator. So I was a bit surprised to see the issue at
all - 'cause I’ve never seen such high latency on a receive flowgraph.
We
have a “closed” loop approach to managing transmit latency through the
USRP
without timestamps or relying on a source stream for throttling. But I
do
think the advice you provide is useful for other’s who are trying to
solve transmitter
latency by shrinking buffers.

Per the email above - it turned out that buffer sizes were not an issue.
Something else was weird was happening - see (2) on my prev email if
you’re
interested.

Thanks for following up on this thread.

-John

jmaenpaa · October 17, 2014, 7:18pm

We see this issue a lot with applications that only transmit, and which
transmit continuously. The problem is that you end up generating
samples
far in advance of when you really know what you want to transmit,
because
there is no rate-limiting on the production side.

Some general principles – Large buffers allow you to deal with high
latency. Large buffers do not create high latency unless the
application
is not designed properly. A properly designed application will work
with
infinitely large buffers as well as it does with minimally sized ones.

Shrinking buffers may allow your application to work, but that isn’t
really
the best way to solve this problem. The best way to solve the problem
is
to modify your head-end source block to understand wall-clock time. The
easiest way to do that if you are using a USRP is to instantiate a UHD
source (i.e. a receiver) at a relatively low sample rate and feed it
into
the source you have created.

Your source block should then look at timestamps on the incoming samples
(it can throw the samples themselves away). It should generate only
enough
samples to cover the maximum latency you want, and it should timestamp
those transmit samples. For example, if it receives samples timestamped
with T1, it should generate samples with timestamps from T1+L1 to
T1+L1+L2,
where L1 is the worst-case flowgraph and device latency, and L2 is the
worst case reaction time you are looking for. Thus, if you suddenly get
a
message from your operator to send a message, you know that you will
never
need to wait for more than L2 seconds. Thus, you can bound your worst
case
reaction time.

It should be noted that in two-way application like GSM or LTE, you
would
never run into these problems and they are naturally avoided because you
won’t generate samples until you’ve seen what you receive. It only is
an
issue in TX-only apps.

I think we should generate an example app to do this, because the issue
comes up periodically, especially among the space communications crowd.
It
is a design pattern we really should document.

Matt

jmaenpaa · October 18, 2014, 4:12pm

On 17.10.2014 18:58, John M. wrote:

We tried several “clever” (ha) methods to select the desired
stream. Most of them revolved around the concept of
summering/xoring streams after multiplying or and’ing the streams
according to which stream we wanted (operand = 1 if we wanted the
stream, 0 if not). Through various methods we found that anything
with a constant source, and const, etc, create this very long
latencies. The behavior was this:

multiply_matrix might do what you want here. There’s an example in
gr-blocks that shows how to use that block for selecting.

M

jmaenpaa · June 2, 2015, 10:22pm

Hello,

I resurrect this old thread because I’m almost exactly in the situation
described below by Matt but I have an hard time to come up with a
solution.

I’m working with an N210. In my application I generate a modulation
signal that is sent to a system and the response of the system is
demodulated and processed. The modulation signal is adjusted accordingly
to the result of this processing.

Because of the modulation-demodulation scheme, I like to keep a constant
phase relation between the tx and rx channels. To this purpose I use
set_start_time() calls for the tx and rx channels. The result of the
signal processing is communicated to the modulation source via
asynchronous messages. The sampling rate for tx and rx is 1.0 MHz.

What I observe is a ~250 ms delay between the asynchronous messages and
changes in the modulation signal. This delay scales linearly with the tx
sampling rate, thus I conclude it is due to buffering on the tx side.

I haven’t managed to find a way to route any signal from the rx side to
the tx side without causing buffer under-runs, I think because the start
time of the tx and rx sides are fixed.

Does anyone has an idea on how to solve the problem?

On 17/10/14 19:16, Matt E. wrote:

timestamp those transmit samples. For example, if it receives samples
timestamped with T1, it should generate samples with timestamps from
T1+L1 to T1+L1+L2, where L1 is the worst-case flowgraph and device
latency, and L2 is the worst case reaction time you are looking for.
Thus, if you suddenly get a message from your operator to send a
message, you know that you will never need to wait for more than L2
seconds. Thus, you can bound your worst case reaction time.

I think we should generate an example app to do this, because the issue
comes up periodically, especially among the space communications crowd.
It is a design pattern we really should document.

Any news about this example app?

Thank you! Cheers,
Daniele

jmaenpaa · October 17, 2014, 7:35pm

Yes, there were multiple issues in the thread, and what I said only
applies
to TX. Certainly on the receive side there should be no latency issues,
as
every block should do all computations they can on whatever data they
receive, since that is exactly what is tripping up the TX guys

Matt

On Fri, Oct 17, 2014 at 10:25 AM, John M. <