Serve content from a wget mirror

Hi,

I’m trying to create an archive based on a current site which is due to
be taken down, I’ve used “wget -m” to mirror the site and all seems well
except I’m having trouble with what I think are arguments in the url.

Most everything seems to work ok and this is currently my only issue. I
have on disk the file /dir/page.php?a=1&b=2 for example but nginx
returns a 404 when accessing http://localhost/dir/page.php?a=1&b=2, I’ve
tried adding $args to the try_files directive but it’s not made a
difference and my google-fu is more google-fail today.

Can someone point me in the right direction please? My current test
config is below.

user www-data;
worker_processes 4;
pid /run/nginx.pid;
events {
worker_connections 768;
}
http {
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
gzip on;
gzip_disable “msie6”;
}
server {
listen 80 default_server;
listen [::]:80 default_server ipv6only=on;
root /home/steve/archive;
index index.html index.htm index.php;
server_name localhost;
location / {
try_files $uri$args $uri $uri/ =404;
}
}


Steve

On 24/11/2014 12:36, Francis D. wrote:

returns a 404 when accessing http://localhost/dir/page.php?a=1&b=2,
argument (note the ? in there).

But if your current “offline mirror” is complete enough and does not
contain any html forms which POST or have combinations of options that
have not been used already, it is probably unnecessary.

f

Ah yes. Option 2 was what I was trying to do and had missed the ? out :frowning:

I’m going to have a look at wget options to encode as option 1 so even
the try_files won’t be needed.

Unfortunately option 3 isn’t an option as the source of the mirror will
be turned off when I’ve got this historical mirror up and running.

Steve.

On Mon, Nov 24, 2014 at 11:05:10AM +0000, Steve W. wrote:

Hi there,

I’m trying to create an archive based on a current site which is due to
be taken down, I’ve used “wget -m” to mirror the site and all seems well
except I’m having trouble with what I think are arguments in the url.

Most everything seems to work ok and this is currently my only issue. I
have on disk the file /dir/page.php?a=1&b=2 for example but nginx
returns a 404 when accessing http://localhost/dir/page.php?a=1&b=2,

url-space and filename-space are different.

Option 1: in all of the html that contains links that refer
to /dir/page.php?a=1&b=2, url-encode it to be (at least)
/dir/page.php%3Fa=1&b=2 (that is: ? becomes %3F).

With that, you need nothing special from nginx.

Option 2: use your try_files thing but use $uri?$args as the first
argument (note the ? in there).

With that, you need the try_files magic to handle all requests. And
if you happen to have both “page.php” and “page.php?”, you may have
difficulty accessing the former.

Option 3: keep the mirror of “static” content; but reimplement the
back-end for “dynamic” content.

That’s a lot more work, but is really the only way to provide the
full mirror.

But if your current “offline mirror” is complete enough and does not
contain any html forms which POST or have combinations of options that
have not been used already, it is probably unnecessary.

f

Francis D. [email protected]

On Tue, Nov 25, 2014 at 07:13:49PM +0000, Steve W. wrote:

On 24/11/2014 12:36, Francis D. wrote:

On Mon, Nov 24, 2014 at 11:05:10AM +0000, Steve W. wrote:

Hi there,

Option 1: in all of the html that contains links that refer
to /dir/page.php?a=1&b=2, url-encode it to be (at least)
/dir/page.php%3Fa=1&b=2 (that is: ? becomes %3F).

I’m going to have a look at wget options to encode as option 1 so even
the try_files won’t be needed.

I suspect that there won’t be a wget option for that. (Unless “-k”
has grown more options recently.)

I imagine that a perl or sed script to edit the files in-place may
be simplest.

It does depend on the content; but looking for the string “.php?” may
be a good start – does that match anything that you do not want to
convert? And then: is there anything left that you do want to convert?

Good luck with it,

f

Francis D. [email protected]