Fphp-fastcgi and nginx problem

Hi all,

I need some help. A system I’m supporting is collapsing in a heap
whenever it is put under even moderate load!

The setup - Ubuntu, nginx 0.7.62, php-fastcgi.

The application is a chatroom, and is built as follows.

Main screen is multi-panelled. One panel shows data entry area. When
“send” button clicked, the input is sent to the server. Server stores
the data in a MySQL database and returns an empty 200 reply. The input
panel is cleared and ready for the next message.

While this is going on, another panel is using “get” to fetch all the
messages since No. X. The server recieves this, checks the database once
per second for 4 seconds for message X or later. When it finds something
to send to the user (or after 4 seconds) it replies. When the page
returns, the on-load event updates the other panels, including updating
a clock to give confidence the process is working.

Moments after loading the data for message X, the page autoloads,
requesting the messages since X=1; This heartbeat goes on in the
background all the time. The result is a chatroom that “just works”
through proxies without any software intallation, even in tied-down
business sites. If the user can browse the internet, has cookies and
javascript enabled, the system will work.

When we tested it with 4 users - the most we could achieve - things
worked great.

But when there are a lot of users (the report is of 7) the responce
becomes very slow. The logs contain gaps of 20 seconds between replies
and reply codes of 499. There have always been reports of dropped
messages, and people getting logged out, but I could never identify why.

PHP_FCGI_CHILDREN is set at 3.

What I think is happening is that, when 4 users are logged on, the
server is handling 3 of them, and the 4th is waiting in Nginx.
As each message is only in the machine for 4 seconds, things circulate
nicely. When someone posts a message, it may take a moment to queue for
the server, but when it gets handled it will trigger all the other users
to recieve replies quickly, so the queue will clear. The queue will then
form again immediatly.

When there are rather more users - say 8 - at any moment 3 are going
through the server in parallel taking 4 seconds each. The server is
handling 1 every 1.4 seconds. 5 are queueng in nginx, in addition to any
messages. Queue time is therefore at least (5*1.4) + 4 = 11 seconds.
More if anyone posts a message.

It is possible for the queue of messages to contain both a heatbeat and
a message update from a given user.

Question - What will nginx do inthis situation? If it discards one or
the other with a 499 then we will have dropped messages or the heatbeat
will stop. If so, then this is why we have reports of dropped messages
and random breakdowns.

Second. What is the solution? Raising PHP_FCGI_CHILDREN to the number
needed is clearly not going to work - I don’t have enough RAM!

My thinking.

  1. Alter the server code so that it looks once, and replies. It will
    reply with many more “null” returns, but it will handle each request in
    a fraction of a second. The queue will disappear.

  2. Alter the client code, so that it delays longer - say 3 seconds -
    after getting a heatbeat update before requesting the next.

  3. Have the send shorten this delay - perhaps the reply will trigger the
    next heatbeat request - so that your posts come back quickly.

  4. If the heatbeat is in progress when a send is requested, delay the
    send until the heatbeat’s reply is recieved.

Question - will this work? Why? Why not?

Question - is point 4 necessary?

Input and ideas gratefully recieved.

Regards

Ian

Further thinking…

The browser may request images and other content in parallel, and more
than one such file may be php based, so symoultanious requests are not
the problem or the cuase of the 499.

Further investigation revealed that the client times out the server
after 10 seconds or so, and re-sends the request. I rather suspect that
this is the cause of the 499s.

Ian

Rob S. wrote:

Ian

I am not sure of the details but i think you need to handle all the control and timing on the client side and let the php processes do what they are intended to do and that is find new chat messages and return them immediatly. Because having the 4 sec wait in the php process locks that process up for that entire 4 secs… Thus if you only have 3 processes you can only have 3 reqs at any given time and it will be locked up for 4 secs each. I think best would be just return what is there and if nothing is there then have the client expecting this and handle it in the client.

Hi Rob,

Thanks for your input. I have actually made changes for points 1 to 3
and we managed to test it with 7 users. It responded very well.

I don’ think point 4 is necessary. If the browser loads a page with
multiple images, the images come down in parallel, so parallel calls
must work. The only problem may be parallel requests to php, but I can
think of no reason that should fail.

Anyway, my fingers are crossed as I type (what an image!) There is a
sales demo of the system going on this afternoon. We shall see what
happens!

Regards

Ian

On Dec 1, 2009, at 7:00 AM, Ian H. wrote:

Question - will this work? Why? Why not?

Question - is point 4 necessary?

Input and ideas gratefully recieved.

Regards

Ian

I am not sure of the details but i think you need to handle all the
control and timing on the client side and let the php processes do what
they are intended to do and that is find new chat messages and return
them immediatly. Because having the 4 sec wait in the php process locks
that process up for that entire 4 secs… Thus if you only have 3
processes you can only have 3 reqs at any given time and it will be
locked up for 4 secs each. I think best would be just return what is
there and if nothing is there then have the client expecting this and
handle it in the client.

Rob