Hi all,
I need some help. A system I’m supporting is collapsing in a heap
whenever it is put under even moderate load!
The setup - Ubuntu, nginx 0.7.62, php-fastcgi.
The application is a chatroom, and is built as follows.
Main screen is multi-panelled. One panel shows data entry area. When
“send” button clicked, the input is sent to the server. Server stores
the data in a MySQL database and returns an empty 200 reply. The input
panel is cleared and ready for the next message.
While this is going on, another panel is using “get” to fetch all the
messages since No. X. The server recieves this, checks the database once
per second for 4 seconds for message X or later. When it finds something
to send to the user (or after 4 seconds) it replies. When the page
returns, the on-load event updates the other panels, including updating
a clock to give confidence the process is working.
Moments after loading the data for message X, the page autoloads,
requesting the messages since X=1; This heartbeat goes on in the
background all the time. The result is a chatroom that “just works”
through proxies without any software intallation, even in tied-down
business sites. If the user can browse the internet, has cookies and
javascript enabled, the system will work.
When we tested it with 4 users - the most we could achieve - things
worked great.
But when there are a lot of users (the report is of 7) the responce
becomes very slow. The logs contain gaps of 20 seconds between replies
and reply codes of 499. There have always been reports of dropped
messages, and people getting logged out, but I could never identify why.
PHP_FCGI_CHILDREN is set at 3.
What I think is happening is that, when 4 users are logged on, the
server is handling 3 of them, and the 4th is waiting in Nginx.
As each message is only in the machine for 4 seconds, things circulate
nicely. When someone posts a message, it may take a moment to queue for
the server, but when it gets handled it will trigger all the other users
to recieve replies quickly, so the queue will clear. The queue will then
form again immediatly.
When there are rather more users - say 8 - at any moment 3 are going
through the server in parallel taking 4 seconds each. The server is
handling 1 every 1.4 seconds. 5 are queueng in nginx, in addition to any
messages. Queue time is therefore at least (5*1.4) + 4 = 11 seconds.
More if anyone posts a message.
It is possible for the queue of messages to contain both a heatbeat and
a message update from a given user.
Question - What will nginx do inthis situation? If it discards one or
the other with a 499 then we will have dropped messages or the heatbeat
will stop. If so, then this is why we have reports of dropped messages
and random breakdowns.
Second. What is the solution? Raising PHP_FCGI_CHILDREN to the number
needed is clearly not going to work - I don’t have enough RAM!
My thinking.
-
Alter the server code so that it looks once, and replies. It will
reply with many more “null” returns, but it will handle each request in
a fraction of a second. The queue will disappear. -
Alter the client code, so that it delays longer - say 3 seconds -
after getting a heatbeat update before requesting the next. -
Have the send shorten this delay - perhaps the reply will trigger the
next heatbeat request - so that your posts come back quickly. -
If the heatbeat is in progress when a send is requested, delay the
send until the heatbeat’s reply is recieved.
Question - will this work? Why? Why not?
Question - is point 4 necessary?
Input and ideas gratefully recieved.
Regards
Ian