I’m trying to track down an easy way to canonicalize a URL from with
ruby. I’ve been looking around for this but all I can find are some
procedure hacks sure as # canonicalize the url
if ($url -notmatch “^[a-z]+://”) { $url = “http://$url” }
which isn’t going to take into account everything according to RFC 2396
* Remove all leading and trailing dots
* Replace consecutive dots with a single dot.
* If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
* Lowercase the whole string.
The sequences “/…/” and “/./” in the path should be resolved, by
replacing “/./” with “/”, and removing “/…/” along with the preceding
path component.
Runs of consecutive slashes should be replaced with a single slash
Synopsis
URI::parse(uri_str)
Args
+uri_str+: String with URI.
Description
Creates one of the URI's subclasses instance from the string.
Raises
URI::InvalidURIError
Raised if URI given is not a correct one.
Usage
require 'uri'
uri = URI.parse("http://www.ruby-lang.org/")
p uri
# => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/>
p uri.scheme
# => "http"
p uri.host
# => "www.ruby-lang.org"
As for the “Lowercase the whole string” part, only the domain is
required to be case-insensitive. It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.
As for the “Lowercase the whole string” part, only the domain is
required to be case-insensitive. It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.
There’s URI#normalize and URI#normalize! to downcase the host
part of the url.
means the path keeps it’s case sensitivity but the host is normalized.
I think that’s it - however,
try it with ruby-lang…org and
/usr/lib/ruby/1.8/uri/generic.rb:195:in initialize': the scheme http does not accept registry part: www.ruBy-lang..org (or bad hostname?) (URI::InvalidURIError) from /usr/lib/ruby/1.8/uri/http.rb:78:ininitialize’
from /usr/lib/ruby/1.8/uri/common.rb:488:in new' from /usr/lib/ruby/1.8/uri/common.rb:488:inparse’
from canon.rb:3
So I guess it needs a bit or error checking before hand.
p can.host
/usr/lib/ruby/1.8/uri/generic.rb:195:in initialize': the scheme http does not accept registry part: www.ruBy-lang..org (or bad hostname?) (URI::InvalidURIError) from /usr/lib/ruby/1.8/uri/http.rb:78:in initialize’
from /usr/lib/ruby/1.8/uri/common.rb:488:in new' from /usr/lib/ruby/1.8/uri/common.rb:488:in parse’
from canon.rb:3
So I guess it needs a bit or error checking before hand.
require ‘uri’
def canonicalize(uri)
u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s)
u.normalize!
newpath = u.path
while newpath.gsub!(%r{([^/]+)/../?}) { |match|
$1 == ‘…’ ? match : ‘’
} do end
newpath = newpath.gsub(%r{/./}, ‘/’).sub(%r{/.\z}, ‘/’)
u.path = newpath
u.to_s
end
It’s intended as a standards compliant replacement for the stdlib’s
URI library. Take a look into the test directory of that sucker: over
440 Unit Tests (actually, Object Examples) for a frickin’ URI parser!
(See: http://Addressable.RubyForge.Org/specdoc/) That guy is nuts!
That code’s gotta be as rock-solid as it gets.
Oh, and back to the topic at hand: it has a normalize method built in:
begin
require ‘rubygems’
gem ‘addressable’
rescue LoadError; end
require ‘addressable/uri’
uri =
Addressable::URI.heuristic_parse(‘www.Ruby-Lang…ORG/ARSE/done/…/…/r e
a r/./end/.#exit’)
uri.normalize!
puts uri.display_uri # => http://www.ruby-lang…org/r%20e%20a%20r/end/#exit
I’m not sure. I just scanned RfC3986 and RfC1034 and I’m not even sure
that’s a valid URI host part to begin with. If it’s invalid, then
there’s not much a URI normalizer can do, right?
However, I could be wrong. Reading RfCs is not exactly my specialty.