Segfault in NGINX - testcase

casper_the_ghost · March 19, 2010, 9:20pm

Hello,

NGINX will segfault, if you delete the root node in an rbtree with
two elements. Testcase:
http://doppelbauer.name/nginx-testcase.c

A patch is attached below.

Thanks a lot
Markus Doppelbauer

*** a/core/ngx_rbtree.c 2010-03-19 21:07:14.342916081 +0100
— b/core/ngx_rbtree.c 2010-03-19 21:07:56.642187263 +0100
*************** ngx_rbtree_delete(ngx_thread_volatile ng
*** 187,193 ****
if (subst == *root) {
root = temp;
ngx_rbt_black(temp);
!
/ DEBUG stuff */
node->left = NULL;
node->right = NULL;
— 187,194 ----
if (subst == *root) {
root = temp;
ngx_rbt_black(temp);
! temp->parent = NULL; /
parent of root node should be NULL /
!
/ DEBUG stuff */
node->left = NULL;
node->right = NULL;

Posted at Nginx Forum:

casper_the_ghost · March 20, 2010, 2:01am

Hello!

On Fri, Mar 19, 2010 at 04:19:43PM -0400, double wrote:

Hello,

NGINX will segfault, if you delete the root node in an rbtree with
two elements. Testcase:
http://doppelbauer.name/nginx-testcase.c

A patch is attached below.

From what I see it’s your function ngx_rbtree_next() which
segfaults, not nginx. It tries to identify root node via checking
it’s parent against NULL - which is not correct way to do it, you
should instead compare node pointer against tree root pointer.

If you do still think that nginx is affected - please provide
another test case. The one which uses nginx, not code borrowed
from it, is preffered - see nginx-tests: log for
some framework and test samples.

Maxim D.

casper_the_ghost · March 20, 2010, 9:44am

Hello,

The root node of the nginx-rbtree is always NULL, because
“ngx_rbtree_insert()” provides that feature. Only if the rbtree
has 2 elements and you delete the root node, then “parent” of the
root-node points to the deleted element. Why not fixing it?

Thank you very much
Marcus

The corresponding code:

void ngx_rbtree_insert(ngx_thread_volatile ngx_rbtree_t *tree,
ngx_rbtree_node_t *node)
{
[…]
if (*root == sentinel) {
node->parent = NULL;
node->left = sentinel;
node->right = sentinel;
ngx_rbt_black(node);
*root = node;

Posted at Nginx Forum:

casper_the_ghost · March 20, 2010, 9:49pm

Hi Maxim,

We want the same.
Thanks for pointing out.

Marcus

Posted at Nginx Forum:

casper_the_ghost · March 20, 2010, 9:30pm

Hello!

On Sat, Mar 20, 2010 at 04:43:56AM -0400, double wrote:

Hello,

The root node of the nginx-rbtree is always NULL, because
“ngx_rbtree_insert()” provides that feature. Only if the rbtree
has 2 elements and you delete the root node, then “parent” of the
root-node points to the deleted element. Why not fixing it?

As you already pointed out, there is at least one place where
root->parent becomes non-NULL (and I’m not sure it’s the only
place where it happens). And this doesn’t cause any harm as nginx
doesn’t assume it should be NULL.

While I tend to think that it’s good idea to keep it NULL at least
with NGX_DEBUG defined (to simplify debugging) - there is no bug
here. The bug is in your tree traversal code which tries to use
assumption that root->parent == NULL. And even if your patch will
be applied (it’s up to Igor anyway) - your tree traversal code
should be fixed if you are planning to use it somewhere in
production.

Maxim D.

casper_the_ghost · March 21, 2010, 3:22am

Hi,

With a standard (i.e. no options) installation of 0.8.34 on my Linux
machine I get a segfault if the resolver named in the conf file is a
loopback address/IP, but the resolver does not exist. Other
non-existing resolvers don’t cause a problem (they just hang, and
probably will time out), only loopback ones.

The debug log is :

2010/03/21 04:07:30 [debug] 4287#0: *3 http cl:-1 max:1048576
2010/03/21 04:07:30 [debug] 4287#0: *3 generic phase: 2
2010/03/21 04:07:30 [debug] 4287#0: *3 post rewrite phase: 3
2010/03/21 04:07:30 [debug] 4287#0: *3 generic phase: 4
2010/03/21 04:07:30 [debug] 4287#0: *3 generic phase: 5
2010/03/21 04:07:30 [debug] 4287#0: *3 access phase: 6
2010/03/21 04:07:30 [debug] 4287#0: *3 access phase: 7
2010/03/21 04:07:30 [debug] 4287#0: *3 post access phase: 8
2010/03/21 04:07:30 [debug] 4287#0: *3 http script var: “http”
2010/03/21 04:07:30 [debug] 4287#0: *3 http script copy: “://”
2010/03/21 04:07:30 [debug] 4287#0: *3 http script var: “google.com”
2010/03/21 04:07:30 [debug] 4287#0: *3 http script var: “/”
2010/03/21 04:07:30 [debug] 4287#0: *3 http init upstream, client timer:
0
2010/03/21 04:07:30 [debug] 4287#0: *3 epoll add event: fd:3 op:3
ev:80000005
2010/03/21 04:07:30 [debug] 4287#0: *3 posix_memalign:
0000000001A44DD0:4096 @16
2010/03/21 04:07:30 [debug] 4287#0: *3 http script copy: "Host: "
2010/03/21 04:07:30 [debug] 4287#0: *3 http script var: “google.com”
2010/03/21 04:07:30 [debug] 4287#0: *3 http script copy: "
"
2010/03/21 04:07:30 [debug] 4287#0: *3 http script copy: "Connection:
close
"
2010/03/21 04:07:30 [debug] 4287#0: *3 http proxy header: “User-Agent:
curl/7.19.5 (x86_64-pc-linux-gnu) libcurl/7.19.5 OpenSSL/0.9.8g
zlib/1.2.3.3 libidn/1.15”
2010/03/21 04:07:30 [debug] 4287#0: *3 http proxy header: “Accept: /”
2010/03/21 04:07:30 [debug] 4287#0: *3 http proxy header:
“Proxy-Connection: Keep-Alive”
2010/03/21 04:07:30 [debug] 4287#0: *3 http proxy header:
"GET / HTTP/1.0
Host: google.com
Connection: close
User-Agent: curl/7.19.5 (x86_64-pc-linux-gnu) libcurl/7.19.5
OpenSSL/0.9.8g zlib/1.2.3.3 libidn/1.15
Accept: /
Proxy-Connection: Keep-Alive

"
2010/03/21 04:07:30 [debug] 4287#0: *3 http cleanup add:
0000000001A44E60
2010/03/21 04:07:30 [debug] 4287#0: *3 http finalize request: -4, “/?”
a:1, c:2
2010/03/21 04:07:30 [debug] 4287#0: *3 http request count:2 blk:0
2010/03/21 04:07:30 [debug] 4287#0: *3 http run request: “/?”
2010/03/21 04:07:30 [debug] 4287#0: *3 http upstream check client, write
event:1, “/”
2010/03/21 04:07:30 [debug] 4287#0: *3 http upstream recv(): -1 (11:
Resource temporarily unavailable)
2010/03/21 04:07:30 [alert] 4275#0: worker process 4287 exited on signal
11

I’ve not got time to look into and write a patch for it right now, but
will do if no-one gets around to it at some point.

Thanks,

Marcus.

casper_the_ghost · March 25, 2010, 3:06am

Hello!

On Sun, Mar 21, 2010 at 04:21:38AM +0200, Marcus C. wrote:

Hi,

With a standard (i.e. no options) installation of 0.8.34 on my Linux
machine I get a segfault if the resolver named in the conf file is a
loopback address/IP, but the resolver does not exist. Other
non-existing resolvers don’t cause a problem (they just hang, and
probably will time out), only loopback ones.

The debug log is :

[…]

2010/03/21 04:07:30 [debug] 4287#0: *3 http cleanup add: 0000000001A44E60
2010/03/21 04:07:30 [debug] 4287#0: *3 http finalize request: -4, “/?” a:1, c:2

Could you please provide full debug log (i.e. switched on at global
level, not
http/server/location)? E.g. between the above lines should be some
valuable
resolving information which isn’t logged in request context but in
global one
instead.

It should look like this:

…
2010/03/25 04:53:49 [debug] 86639#0: *1 http cleanup add: 0811F8F4
2010/03/25 04:53:49 [debug] 86639#0: malloc: 08126580:68
2010/03/25 04:53:49 [debug] 86639#0: resolve: “google.com”
2010/03/25 04:53:49 [debug] 86639#0: malloc: 0810D580:60
2010/03/25 04:53:49 [debug] 86639#0: malloc: 08133460:10
2010/03/25 04:53:49 [debug] 86639#0: malloc: 0810AD00:28
2010/03/25 04:53:49 [debug] 86639#0: resolve: “google.com” 12007
2010/03/25 04:53:49 [debug] 86639#0: UDP socket 11
2010/03/25 04:53:49 [debug] 86639#0: connect to 127.0.0.1:53, fd:11 #2
2010/03/25 04:53:49 [debug] 86639#0: kevent set event: 11: ft:-1 fl:0025
2010/03/25 04:53:49 [debug] 86639#0: send: fd:11 28 of 28
2010/03/25 04:53:49 [debug] 86639#0: malloc: 0810D640:60
2010/03/25 04:53:49 [debug] 86639#0: event timer add: -1:
30000:2466707655
2010/03/25 04:53:49 [debug] 86639#0: event timer add: -1:
5000:2466682655
2010/03/25 04:53:49 [debug] 86639#0: *1 http finalize request: -4, “/?”
a:1, c:2
…

but will do if no-one gets around to it at some point.
Config and backtrace should be helpfull too. Unfortunately I’m not able
to
reproduce the problem.

Maxim D.

casper_the_ghost · March 25, 2010, 7:09pm

Just a quick note, if someone needs (for an addon) to traverse an
rbtree,
here is the clean way:

static ngx_inline ngx_rbtree_node_t *
ngx_rbtree_next(ngx_rbtree_t *rbtree, ngx_rbtree_node_t *node)
{
ngx_rbtree_node_t *sentinel;
sentinel = rbtree->sentinel;
if( node->right != sentinel )
{
node = node->right;
while( node->left != sentinel )
node = node->left;
}
else
{
ngx_rbtree_node_t *root, *temp;
root = rbtree->root;
temp = node;
while( temp != root && node == temp->parent->right )
temp = node = temp->parent;
if( temp == root )
return sentinel;
if( node->right != temp->parent )
node = temp->parent;
}
return node;
}

Posted at Nginx Forum:

casper_the_ghost · March 25, 2010, 8:32pm

Sry, this is better:

static ngx_inline ngx_rbtree_node_t *
ngx_rbtree_next(ngx_rbtree_t *rbtree, ngx_rbtree_node_t *node)
{
if( node->right != rbtree->sentinel )
{
node = node->right;
while( node->left != rbtree->sentinel )
node = node->left;
}
else
{
ngx_rbtree_node_t *root;
root = rbtree->root;
while( node != root && node == node->parent->right )
node = node->parent;
if( node == root )
return rbtree->sentinel;
node = node->parent;
}
return node;
}

Posted at Nginx Forum:

casper_the_ghost · September 8, 2010, 6:26am

Hi,

Not only loopback resolver address, but any addresses without UDP:53
port listened can trigger this segfault. I’ve confirmed the problem on
my Linux box. Here’s the backtrace of nginx-0.8.49 when segfault
happens:

#0 0x08059acd in ngx_log_error_core (level=4, log=0x810c16c, err=111,
fmt=0x80e92cd “recv() failed”) at src/core/ngx_log.c:93
#1 0x08062442 in ngx_connection_error (c=0x812c538, err=111,
text=0x80e92cd “recv() failed”) at src/core/ngx_connection.c:1015
#2 0x0806ef06 in ngx_udp_unix_recv (c=0x812c538, buf=0xbfffdcfc
“m\220”, size=4096) at src/os/unix/ngx_udp_recv.c:99
#3 0x080689b1 in ngx_resolver_read_response (rev=0x81424f8) at
src/core/ngx_resolver.c:952
#4 0x08073035 in ngx_epoll_process_events (cycle=0x810d528, timer=5000,
flags=) at src/event/modules/ngx_epoll_module.c:642
#5 0x0806baa2 in ngx_process_events_and_timers (cycle=0x810d528) at
src/event/ngx_event.c:245
#6 0x08070edc in ngx_single_process_cycle (cycle=0x810d528) at
src/os/unix/ngx_process_cycle.c:306
#7 0x08059576 in main (argc=1, argv=0xbfffef74) at src/core/nginx.c:398

The bug exists in all versions later than nginx-0.8.36 but not in
nginx-0.7.x versions.

This problem is due to the incorrectly copy of cycle->new_log when
ngx_resolver_create() initializing udp_connection->log. Because
cycle->new_log only gets initialized in ngx_init_cycle() after all
configurations have been processed, the ngx_resolver_create() will be
called BEFORE cycle->new_log has anything meaningful. And because it
COPIED cycle->new_log, udp_connection->log will always be invalid even
cycle->new_log does get initialized properly later. When resolver is
used to resolve names and failed to connect the specified nameserver,
nginx tries to log a timeout error using the invalid log structure
stored in udp_connection, and then boom…

IMHO, by changing the definition of ngx_udp_connection_t to make log
field a pointer instead of a ngx_log_t structure (also with a bunch of
related reference modifications, of course), this bug will gone without
pains.

The attached patch should fix the bug.