Felix W. writes:
-----Original Message-----
From: Daniel B. [mailto:[email protected]]
Sent: Wednesday, August 29, 2007 10:24 AM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?
It looks like URI.parse doesn’t like the leading number:
irb(main):001:0> require ‘uri’
irb(main):003:0> URI.parse(“http://xshare”)
=> #<URI::HTTP:0x16fd906 URL:http://xshare>
irb(main):004:0> URI.parse(“http://xshare-foo”)
=> #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
irb(main):006:0> URI.parse(“http://3qshare”)
URI::InvalidURIError: the scheme http does not accept registry
part:
3qshare (or bad hostname?)
from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in
initialize' from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in
initialize’
from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in new' from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in
parse’
from (irb):6
I couldn’t tell you what the proper behavior is.
Regards,
Dan
That is true, and due to the following regular expressions from
uri/common.rb:
domainlabel = alphanum | alphanum *( alphanum | “-” ) alphanum
DOMLABEL = “(?:#{ALNUM}?)”
toplabel = alpha | alpha *( alphanum | “-” ) alphanum
TOPLABEL = “(?:#{ALPHA}?)”
hostname = *( domainlabel “.” ) toplabel [ “.” ]
HOSTNAME = “(?:#{DOMLABEL}\.)*#{TOPLABEL}\.?”
So a valid hostname will consist of optional DOMLABELs in front of a
TOPLABEL. The TOPLABEL must start with a letter, end in a letter or
digit,
with letters, digits and hyphens inbetween the two.
That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt]:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.
The error thrown by URI.parse is a little odd in this context, but
explained
as follows:
In the URI.parse chain, the URI is checked against a longer regular
expression that only partly matches the hostname, but also other URI
parts
(such as userinfo, the scheme etc.). The hostname part doesn’t match
here
because it’s dealing with an invalid hostname. The URI registry part
does
match your invalid hostname, so this information is passed on in the
array
of matched URI parts for the registry.
This array is then checked in Generic.new. That constructor finds the
string
passed for the registry, but the class is hard coded to not use
registries:
USE_REGISTRY = false
DOC: FIXME!
def self.use_registry
self::USE_REGISTRY
end
And in the constructor:
if @registry && !self.class.use_registry
raise InvalidURIError,
“the scheme #{@scheme} does not accept registry part: #{@registry}
(or bad
hostname?)”
end
To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host
according
to the RFC, so the correct solution would be to rename the host.
Hope that helps,
Felix
While I believe that Felix’s analysis is valid, the problem is that
there are valid, real domains that start with numbers, and URI should
parse those, and in fact, it generally does.
irb(main):002:0> require ‘uri’
=> true
irb(main):003:0> URI.parse(‘http://slashdot.org’)
=> #<URI::HTTP:0x2fee3c URL:http://slashdot.org>
irb(main):004:0> URI.parse(‘http://401k.com’)
=> #<URI::HTTP:0x2fca24 URL:http://401k.com>
irb(main):006:0> URI.parse(‘http://www.3com.com’)
=> #<URI::HTTP:0x2f7b64 URL:http://www.3com.com>
irb(main):007:0> URI.parse(‘https://401k.fidelity.com’)
=> #<URI::HTTPS:0x2f5364 URL:https://401k.fidelity.com>
All of these are real domains for real websites, and thus, the
suggestion of “rename the host” would not work very well.
The problem is probably better illustrated by this example:
irb(main):005:0> URI.parse(‘http://www.example.4bad’)
URI::InvalidURIError: the scheme http does not accept registry part:
www.example.4bad (or bad hostname?)
from /usr/local/lib/ruby/1.8/uri/generic.rb:195:in initialize' from /usr/local/lib/ruby/1.8/uri/http.rb:78:in
initialize’
from /usr/local/lib/ruby/1.8/uri/common.rb:488:in new' from /usr/local/lib/ruby/1.8/uri/common.rb:488:in
parse’
from (irb):5
Here, the top-level domain starts with a digit, and that is not
allowed. And we will most likely never see such a beast out in the
world. So the work-around for Dan’s original problem would be to
specify the domain name with the hostname: 3qshare.
But, I would contend that this is a bug in URI. My suggestion would
be that the regex for HOSTNAME be:
HOSTNAME = “#{DOMLABEL}(?:(?:\.#{DOMLABEL})*\.#{TOPLEVEL}\.?)”
(I’ll admit I’m not that familiar with this regex notation, so I’m
winging it; apologies for any mistakes.) The point is that the
hostname may not be specified with a domain, and if so, must still be
parsed. If the hostname is either a fully qualified hostname or just
a domain name, then the format of a top-level domain must be checked
and enforced, with optional (sub)domains in between.
Of course, I’m working from what Felix gave above; I haven’t gone
through uri/common.rb to any significant extent, so there may be other
things that this suggestion would cause to break.
Coey
–
Coey M.
Senior Test Engineer
(651) 628-2831
[email protected]
Secure Computing(R)
Your trusted source for enterprise security™
http://www.securecomputing.com
NASDAQ: SCUR
*** The information contained in this email message may be privileged,
confidential and protected from disclosure. If you are not the intended
recipient, any review, dissemination, distribution or copying is
strictly prohibited. If you have received this email message in error,
please notify the sender by reply email and delete the message and any
attachments. ***