Problem with PCRE matching, utf-8, Greek, rewrite

Dear all,
I try to implement some rewrites using regular expressions and my URIs
will contain Greek characters.

Trials of the REs are going ok when tested with pcretest:

[root@localhost ~]# pcretest
PCRE version 8.10 2010-06-25

  re> #^[\x{0386}-\x{03FF}]+$#8
data> bv
No match
data> Τηλέ
 0: \x{3a4}\x{3b7}\x{3bb}\x{3ad}

note the 8 modifier that actually tells PCRE to do a UTF-8 matching.

Having the RE in nginx.config complains about

[emerg]: pcre_compile() failed: character value in \x{...} sequence is
too large in

which I guess means that somehow nginx calls PCRE without the PCRE_UTF8
option flag

Am I right? How can I implement these Greek character URL rewrites?

The system environment is:

  • CentOS 5.4
  • PCRE 8.10 with utf-8 and utf-properties enabled
  • nginx 0.8.42

Cheers
Tilemahos

Posted at Nginx Forum:

Hello!

On Thu, Jul 01, 2010 at 11:33:49AM -0400, tmanolat wrote:

re> #^[\x{0386}-\x{03FF}]+$#8

[emerg]: pcre_compile() failed: character value in \x{...} sequence is
> too large in 

which I guess means that somehow nginx calls PCRE without the PCRE_UTF8
option flag

Am I right? How can I implement these Greek character URL rewrites?

Using (*UTF8) to switch pcre into utf-8 mode should work find in
both nginx and pcretest. See man pcresyntax for details.

Maxim D.

tmanolat at 2010-7-1 23:33 wrote:

re> #^[\x{0386}-\x{03FF}]+$#8

[emerg]: pcre_compile() failed: character value in \x{...} sequence is
> too large in 

which I guess means that somehow nginx calls PCRE without the PCRE_UTF8
option flag

Am I right? How can I implement these Greek character URL rewrites?

I use the raw bits in Chinese character substitution in my subscitution
module(Google Code Archive - Long-term storage for Google Code Project Hosting.)

I think you could convert the Greek cahracter like this:

‘\x3a\x43\xb7\x3b\xb3\xad’

Posted at Nginx Forum: problem with PCRE matching, utf-8, Greek, rewrite


nginx mailing list
[email protected]
nginx Info Page


Weibin Y.

tmanolat Wrote:

FYI I put in nginx.conf:


if ($request_uri ~
“(UTF8)^(.)[\?|&]filename=([%
,a-zA-Z0-9\x{386}-\x{3ff}_-.]+)(&.*)?$”) {

I’ve got a very similar problem in nginx but I dont really understand
your solution. Could you please post your nginx.conf or at least some
more lines related to UTF8 filenames conversions? It would be a life
saver!

Posted at Nginx Forum:

Dear Maxim,
it works like a charm now.

FYI I put in nginx.conf:

...
      if ($request_uri ~ "(*UTF8)^(.*)[\\?|&]filename=([%
,a-zA-Z0-9\x{386}-\x{3ff}_\-\.]+)(&.*)?$") {
...

Kindest regards,
Tilemahos Manolatos

PS. @Weibin Y.: I would like to avoid “fixed” character lists, I wanted
to use ranges of characters, so the above solution seems, in my opinion,
better for matching all Greek characters

Posted at Nginx Forum:

initially this worked well: (\x{386}-\x{3ff} for Greek chars)

    location ~ "^(/optionalwebappname)?/ProcessImageServlet.*$" {
        root   /opt/myfilerepository/;
        rewrite ^(.+)$ http://static-dev.myhost.eu/$arg_hotel_id/$th$fn
break;

      set $hid '';
      set $filename '';
      set $th '';

      if ($request_uri ~ "^(.*)[\\?|&]hotel_id=([0-9]+)(&.*)?$") {
        set $hid $2;
      }
      if ($request_uri ~ "(*UTF8)^(.*)[\\?|&]filename=([%
,a-zA-Z0-9\x{386}-\x{3ff}_\-\.]+)(&.*)?$") {
        set $fn $2;
      }
      if ($request_uri ~ "^(.*)[\\?|&]type=th(&.*)?$") {
        set $th 'th_';
      }
      rewrite ^(.+)$ http://static-dev.myhost.eu/$hid/$th$fn break;
      access_log  logs/site-pis.log  main;
      expires           1h;
    }

however, later I found this to work better, including of course utf8
arguments - you would better check this out first… much more elegant

                location ~
"^(/optionalwebappname)?/ProcessImageServlet.*$" {
      set $th '';

      if ($request_uri ~ "^(.*)[\\?|&]type=th(&.*)?$") {
        set $th 'th_';
      }
      rewrite ^(.+)$
http://static-dev.myhost.eu/$arg_hotel_id/$th$arg_filename break;
      expires           1d;
    }

Posted at Nginx Forum:

I would really like a wiki page on UTF-8 support as well.

*UTF8 doesnt work for me though, ive tried.

When attempting to use *UTF8 I always receive.
[emerg]: pcre_compile() failed: (VERB) not recognized in
“(UTF8)^/([^/^.]+)(?:/?)(?:index([0-9]).html?)?$” at
"8)^/([^/^.]+)(?:/?)(?:index([0-9]
).html?)?$" in
/etc/nginx/sites-enabled/nexusddl.com:85

Yet my PCRE has UTF-8 support, tested it in PHP (both nginx and php
compiled against PCRElib included in debian)

Hello!

On Sat, Sep 25, 2010 at 11:43:48AM +1000, mat h wrote:

When attempting to use *UTF8 I always receive.
[emerg]: pcre_compile() failed: (VERB) not recognized in
“(UTF8)^/([^/^.]+)(?:/?)(?:index([0-9]).html?)?$” at
"8)^/([^/^.]+)(?:/?)(?:index([0-9]
).html?)?$" in
/etc/nginx/sites-enabled/nexusddl.com:85

Yet my PCRE has UTF-8 support, tested it in PHP (both nginx and php
compiled against PCRElib included in debian)

You need at least pcre 7.9 for (*UTF8) support.

http://www.pcre.org/changelog.txt

[…]

š š š š š š š š š š š šif ($request_uri ~ “(UTF8)^(.)[\?|&]filename=([%
,a-zA-Z0-9\x{386}-\x{3ff}_-.]+)(&.*)?$”) {
š š š š š š š š š š š š š š š šset $fn $2;
š š š š š š š š š š š š}

Note that (*UTF8) is meaningless here as $request_uri doesn’t
contain utf-8 characters, it’s urlencoded.

Maxim D.

so, seems that the following is the most correct, (at least for me
worked well so far)

        location ~ "^(/optionalwebappname)?/ProcessImageServlet.*$" {
      set $th '';

      if ($request_uri ~ "^(.*)[\\?|&]type=th(&.*)?$") {
        set $th 'th_';
      }
      rewrite ^(.+)$
http://static-dev.myhost.eu/$arg_hotel_id/$th$arg_filename break;
      expires           1d;
    }

Again, my problem was to rewrite some urls with GET vars that were
expected to contain utf8 characters.
So, the $arg_filename is passed correctly to the rewrite.

Posted at Nginx Forum: