OK, I have an incredibly weird nginx connection issue.
I have a cluster of boxes that are responsible for terminating SSL
requests
and passing them to a local haproxy instance for further routing. I have
corosync/pacemaker setup to manage the IP addresses and failover
instances
if there’s an issue.
This server has been running fine for a long time, but we recently had
to
reboot because of the GHOST stuff. Before we did that, we did an apt-get
upgrade to get to the latest Debian Wheezy packages, including a new
nginx
(1.6.2), openssl, kernel, and just about
After that happened, we started seeing connection issues to the nginx
that
does SSL termination. We When it was happening, about 50% of our
requests
were timing out (iOS/Android clients). I was testing manually using curl
when it was happening, and we were seeing huge fluctuations in the time
it
takes to connect. I saw a lot of connections just timing out completely,
in
combination with connections take 1s, 3s, 15s, 30s, etc…
When this issue was happening to nginx, haproxy on the same box was
unaffected, tested by curling every second from a box close to it,
logging
the results and verifying results. So, it seemed to just be SSL with
nginx.
Now that our peak load is down, it’s not as big an issue, but we are
still
seeing connection issues when I curl, just more like 1-3s typically,
just
not as many. Since we’ve had some time to experiment, I’ve gathered more
information that makes no sense to me.
Almost all the traffic was setup to go to the address managed by
corosync.
When I setup my curl tests to run every second, I see the timeouts. SO,
I
tried something. I bound the main ip address of the NIC to nginx,
reloaded,
and redid the same test, but pointed the curl to go to the main ip
address.
As soon as I did that, my curl tests never saw a single issue and the
connect phase never takes more than 2ms and no timeouts.
So, I started thinking it was the corosync IP, so I sent all our traffic
to
go to the main nic ip address that just tested fine, and once the normal
traffic levels switched over to main nic, I started seeing curl timeouts
now that it had traffic. So, I then started curling the IP from corosync
that used to be primary, and now IT has no connection issues.
So, I have connection issues to nginx but only on the IP address that
takes
the traffic. nginx on a different IP on the same NIC is fine. haproxy on
the same NIC as fine.
What the heck? Struggling to think of anything I could tweak. This
doesn’t
make sense, but I have triple checked my info, and it’s legit.