Cache doesn't expire as should

Hi,

I stuck in analyzing this and can not repeat in a test environment.

We see ‘/’ location “freeze” from time to time as nginx stops updating
cache. This lasts for 30 minutes - an hour.

Regular requests flow:
HIT

HIT
UPDATING
UPDATING
EXPIRED (every 30 seconds)
HIT

So under normal conditions a page is served from the cache, cache record
expires, next request goes to upstream locking the cache record,
requests
still served from cache with UPDATING cache status while UPDATING
request
is working, on return from the upstream cache is updated, unlocked and
EXPIRED cache status show up in the log. Back to step one - serving from
cache.

Abnormal flow we see under nginx 1.6.x
HIT

HIT
UPDATING
STALE
UPDATING
UPDATING
STALE
UPDATING
STALE
STALE

continues for a while

HIT

We first noticed this when attempted upgrade to 1.8.0 where situation
was
much worse:
HIT

HIT
UPDATING
UPDATING
UPDATING
UPDATING

continues until cache clean

We downgraded nginx back to 1.6, after some time realized that this
situation also happens on 1.6, but page doesn’t freeze forever.

I don’t say it only happens with ‘/’, it’s just the busiest page, we
probably don’t notice other pages. On 1.8 this happened also on top
level
sections /football/. I feel it’s related to amount of pressure on the
page.

Our error log has “ignore long locked inactive cache entry” alerts, but
I
really couldn’t match it to “defreeze” event. Access log has
STALE/UPDATING
requests between the alert and EXPIRED (cache updating) request.

Any help on hunting it down would be appreciated.

Your (pastebin) config?
Cache situation? (harddisks, cache configuration, etc.)

Posted at Nginx Forum:

On Tue, Dec 22, 2015 at 3:42 PM, itpp2012 [email protected] wrote:

Your (pastebin) config?

Cache situation? (harddisks, cache configuration, etc.)

Pasted relevant configuration. Can add more on request, our complete
config
is 4k lines splitted into 200 files.

Hosts are FreeBSD 8.4-RELEASE-p4 with the cache on a memory disk with
UFS.

nginx/1.6.3

Hello!

On Tue, Dec 22, 2015 at 02:43:11PM +0300, wrote:

[…]

Our error log has “ignore long locked inactive cache entry” alerts, but I
really couldn’t match it to “defreeze” event. Access log has STALE/UPDATING
requests between the alert and EXPIRED (cache updating) request.

The “ignore long locked inactive cache entry” alerts indicate that
a cache entry was locked by some request, and wasn’t unlocked for
a long time. The alert is expected to appear if a cache node is
locked for cache inactive time (as set by proxy_cache_path
inactive=, 10 minutes by default).

Most likely reason is that a worker died or was killed
while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).

Trivial things to consider:

  • check logs for segmentation faults;

  • if you are using 3rd party modules / patches, try without them;

  • make sure you don’t kill worker processes yourself or using some
    automation scripts (in particular, don’t try to terminate old
    worker processes after a configuration reload).


Maxim D.
http://nginx.org/

On Tue, Dec 22, 2015 at 4:39 PM, itpp2012 [email protected] wrote:

Have you tried without locking?

No as it’s unacceptable load increase on backends when cache record
expires.

Have you tried without locking?

Posted at Nginx Forum:

On Tue, Dec 22, 2015 at 4:49 PM, Maxim D. [email protected]
wrote:

The “ignore long locked inactive cache entry” alerts indicate that
a cache entry was locked by some request, and wasn’t unlocked for
a long time. The alert is expected to appear if a cache node is
locked for cache inactive time (as set by proxy_cache_path
inactive=, 10 minutes by default).

Inactive is defined in the config, but it’s set to default 10m.

What happens with requests after this time? Do they hit backend and
update
cache? Do they use stale version?

I’m going to check “long locked” messages in the log to see how many was
for “/” location.
The hash should be the same if we didn’t change cache key, right?

Most likely reason is that a worker died or was killed

while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).

Shouldn’t be there a record in error log? Error log level at warn.

Trivial things to consider:

  • check logs for segmentation faults;

no seg faults in logs

  • if you are using 3rd party modules / patches, try without them;

from freebsd port, updated gist [1] with nginx -V output

  • make sure you don’t kill worker processes yourself or using some
    automation scripts (in particular, don’t try to terminate old
    worker processes after a configuration reload).

One recent appearance of the problem was at 1:30AM and I checked
logs for crazy midnight deploys - nothing.

Also, we don’t use anything custom to restart nginx, just regular
services
management tools.

[1] nginx_cache.conf · GitHub

On Wed, Dec 23, 2015 at 6:35 PM, Maxim D. [email protected]
wrote:

a long time. The alert is expected to appear if a cache node is
As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager. Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.

This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set. But in general
it indicates that there is a problem.

uwsgi on backends has harakiri setting at 9 seconds and 10 seconds
timeout
in
nginx. Index page takes 0.2-0.7 seconds to render.

(Note well that “locked” here should not be confused with

Most likely reason is that a worker died or was killed

result of an upstream server just being very slow)?

Looks like we have some misconfiguration and I get the following:

2015/12/17 01:49:13 [crit] 5232#0: *599273664 rename()
“/var/nginx/cache/0080683483” to
“/var/www/storage/img/ru-cyber/answers/10-5.jpg” failed (13: Permission
denied) while reading upstream, client: 188.115.151.67, server:
www.sports.ru, request: “GET /storage/img/ru-cyber/answers/10-5.jpg
HTTP/1.1”, upstream: "
http://192.168.1.10:80/storage/img/ru-cyber/answers/10-5.jpg", host: "
www.sports.ru", referrer: “Рак или так? Узнай свой психологический MMR - Блоги - Cyber.Sports.ru

2015/12/17 01:50:17 [crit] 5229#0: *599313395 open()
“/var/www/storage/img/fantasy/shirts/ule/slo.png” failed (13: Permission
denied), client: 192.168.1.10, server: www.sports.ru, request: “GET
/storage/img/fantasy/shirts/ule/slo.png HTTP/1.0”, host:
www.sports.ru”,
referrer: "
http://fantasy-h2h.ru/analytics/fantasy_statistics/liga_europa_2015"

management tools.

Just a regular kill will be enough screw up things. On the other
hand, it may be an nginx bug as well.

Note though, that the version you are using is quite old.
Debugging anything but 1.9.x hardly make sense. You may want to
try upgrading to the latest nginx-devel port.

As I said before 1.8.0 made things worse. May be we will try nginx 1.9.x
on
one of frontends and see how it goes.

Hello!

On Wed, Dec 23, 2015 at 11:57:53AM +0300, wrote:

STALE/UPDATING

What happens with requests after this time? Do they hit backend and update
cache? Do they use stale version?

As long as an entry is locked and not accessed for the inactive
time, it will cause the alert in question to be logged by cache
manager. Nothing else happens - that is, the entry is still
locked, and if it contains information that the entry is currently
being updated, requests to this entry will return stale content
with UPDATING status as per proxy_cache_use_stale.

This alert may in theory happen in normal situation if a backend
response takes longer than the inactive time set. But in general
it indicates that there is a problem.

(Note well that “locked” here should not be confused with
request-level locks as in proxy_cache_lock.)

I’m going to check “long locked” messages in the log to see how many was
for “/” location.
The hash should be the same if we didn’t change cache key, right?

Any message indicate that there is a problem. If the “/” location
is requested often, it’ll likely won’t be mentioned in the “ignore long
locked” messages.

Most likely reason is that a worker died or was killed
while holding a lock on a cache node (i.e., while a request was
waiting for a new response from a backend).

Shouldn’t be there a record in error log? Error log level at warn.

There should be something like “worker process exited” messages,
though at least some of them may be on the “notice” level (e.g.,
if you force nginx worker to quit by sending SIGTERM to it
directly).

Trivial things to consider:

  • check logs for segmentation faults;

no seg faults in logs

Any other alerts, crit or emerg messages? Any alerts about
socket leaks on configuration reloads (“open socket left…”),
sockets in CLOSED state? How long relevant requests to upstream
server take (i.e., make sure that the behaviour observed isn’t a
result of an upstream server just being very slow)?

  • if you are using 3rd party modules / patches, try without them;

from freebsd port, updated gist [1] with nginx -V output
[1] nginx_cache.conf · GitHub

The only suspicious thing I see there is “ssi on” and
“proxy_cache_lock” at the same time. There were some fixes
related to proxy_cache_lock with subrequests in more recent
versions.

  • make sure you don’t kill worker processes yourself or using some
    automation scripts (in particular, don’t try to terminate old
    worker processes after a configuration reload).

One recent appearance of the problem was at 1:30AM and I checked
logs for crazy midnight deploys - nothing.

Also, we don’t use anything custom to restart nginx, just regular services
management tools.

Just a regular kill will be enough screw up things. On the other
hand, it may be an nginx bug as well.

Note though, that the version you are using is quite old.
Debugging anything but 1.9.x hardly make sense. You may want to
try upgrading to the latest nginx-devel port.

Note well that if a problem happens, it’s a good idea to either
completely restart nginx or do binary upgrade procedure (“service
nginx upgrade”) to make sure shared memory structures are intact.


Maxim D.
http://nginx.org/