IP to Country (#139)

bbazzarrakk · September 19, 2007, 1:45am

From: “Matthias Wächter” [email protected]

James Edward G. II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway – i’d like to see a 100000 lookups comparison hehe
Great. Run it for us and let us know how we do.

Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory

Thanks for putting that together! Fun to see the different
times.

If I’ve understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

I wasn’t expecting that, at all…

I would have bet Simon’s would be faster. Strange!

Regards,

Bill

bbazzarrakk · September 19, 2007, 5:13am

On Sep 18, 2007, at 6:23 PM, Matthias Wächter wrote:

James Edward G. II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway – i’d like to see a 100000 lookups comparison hehe
Great. Run it for us and let us know how we do.

Here are the results of the supplied solutions so far, and it looks
like my solution can take the 100k-performance victory

Thank you very much for putting this together and wow, your code is
lightening quick.

James Edward G. II

bbazzarrakk · September 19, 2007, 4:34am

From: “Adam S.” [email protected]
On 9/18/07, Bill K. [email protected] wrote:

If I’ve understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

It depends what you mean by unmodified - my algorithm runs off the
original file, the only “modification” I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

Ah, I see. Cool.

Incidentally since the file format description indicates comment
lines may appear anywhere in the file, I allowed for that. However,
I doubt adding a loop to your gets/split logic to keep going until a
valid record was found would affect your time much at all.

Nice job

Regards,

Bill

bbazzarrakk · September 19, 2007, 9:28am

On 9/19/07, Matthias Wächter [email protected] wrote:

James Edward G. II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway – i’d like to see a 100000 lookups comparison hehe
Great. Run it for us and let us know how we do.

[**]: O Jesus :), I can’t make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn’t
manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don’t think I’ll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn’t
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple. On the other hand I don’t expect the first
version to perform anywhere near the other solutions, so it’s not so
important :-).

Jesus.

bbazzarrakk · September 19, 2007, 2:21pm

On Sep 19, 2007, at 2:27 AM, Jesús Gabriel y Galán wrote:

couldn’t
manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don’t think I’ll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn’t
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple.

Comments are not a part of the CSV specification, so FasterCSV
doesn’t address them. I would like to add ignore patterns at some
point though.

James Edward G. II

bbazzarrakk · September 19, 2007, 5:15pm

Ok, here’s the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s

Here’s my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/./).join “/”
puts File.open(f).readlines[0] rescue puts “Unknown”
}

I think it’s pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not?

Mark.

bbazzarrakk · September 19, 2007, 7:34pm

From: “James Edward G. II” [email protected]

the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not?

LOL! Nice…

Pretty clever.

I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.

Indeed, I was thinking last night about preprocessing the data
into an on-disk hash table. I was thinking about a flat file
at the time, but one could use a subdirectory technique like the
above… using hex components of the hash value to index through
a couple subdirs to reach a leaf file containing one or more
records.

Another thing I’d like to try but won’t have time for, is to
use ruby-mmap somehow. (Preprocess the data into a flat binary
file, then use ruby-mmap when performing the binary search.)

Anyway thanks for the fun quiz.

Regards,

Bill

bbazzarrakk · September 19, 2007, 10:45pm

Adam S. wrote:

I would have bet Simon’s would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it’s the regexp that slowed it
down.

Without looking at the other solutions in detail i think one of the
problems
may be that my solution opens the file for each lookup - thats of course
easy
to fix. I don’t know if thats the problem or the overhead of creating
ten
thousand IPAddr objects - i refuse to analyse this in depth because i
don’t
have a usecase for looking up that many locations in a single run.

(on the other hand i do understand how much fun it can be to optimize
such a
problem to death - so go on if you like, i don’t have the motivation -
this time

cheers

Simon

bbazzarrakk · September 19, 2007, 5:43pm

On Sep 19, 2007, at 10:15 AM, Mark T. wrote:

Here’s my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/./).join “/”
puts File.open(f).readlines[0] rescue puts “Unknown”
}

I think it’s pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not?

Pretty clever.

I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.

James Edward G. II

bbazzarrakk · September 25, 2007, 11:03pm

On Sep 15, 2007, at 2:45 AM, Clifford H. wrote:

Ruby Q. wrote:

This week’s Ruby Q. is to write a simple utility. Your program
should accept
an IP address as a command-line argument and print out the two
letter code for
the country that IP is assigned in.

I assume that it’s not ok for folk to use my Ruby G.IP gem?
It has a reasonable compromise between speed and memory usage.

Please do. I would love to see how it stacks up against the custom
solutions we will no doubt see.

James Edward G. II

bbazzarrakk · September 25, 2007, 11:03pm

From: “Eugene K.” [email protected]

1, 2, and 5?

Check what ccountry is 5.1.1.1. If you get any valid answer - this answer
is a lie

Ah, OK. I get:

ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1
5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
US
(5.1.1.1 not found)

Regards,

Bill

bbazzarrakk · September 25, 2007, 11:04pm

From: “Simon Kröger” [email protected]

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

We probably did something similar. Mine also works on the
unmodified IpToCountry.csv file.

$ time ruby 139_ip_to_country.rb 67.19.248.74 70.87.101.66
205.234.109.18 217.146.186.221 62.75.166.87
US
US
US
GB
DE

real 0m0.122s
user 0m0.015s
sys 0m0.000s

(ruby 1.8.4 (2005-12-24) [i386-mswin32], timed from cygwin bash shell,
2GHz athlon64, winxp.)

I don’t think the timings are very accurate on this system. It
didn’t change much whether I looked up one IP or five.

. . . Looking up 80 IPs on one command line resulted in:

real 0m0.242s
user 0m0.015s
sys 0m0.016s

Regards,

Bill

bbazzarrakk · September 25, 2007, 11:06pm

On Sep 14, 2007, at 2:35 PM, Simon Kröger wrote:

user    0m0.259s
real 0m0.046s
user 0m0.046s
sys 0m0.015s

Wow, I’m impressed. Can’t wait to see that code!

James Edward G. II

bbazzarrakk · September 25, 2007, 11:05pm

I think that the timing of the scripts are not a good index. It all
depends on what hardware/os you are running it on.
If we want to use speed as an index we should probably have J.E.
compare them all on the same machine.

Maybe we could also write a ruby script that runs all the entry
scripts and time them, and that could be another ruby quiz which will
also be voted on speed and then we could write a ruby script to time
those entries an then we could … Just ignore this paragraph

Diego Scataglini

bbazzarrakk · September 25, 2007, 11:08pm

This is the code I used to generate the quiz example. To give credit
where credit is due though, it was heavily inspired from some code
Allan Odgaard shared with me:

#!/usr/bin/env ruby -wKU

require “open-uri”
require “zlib”

begin
require “rubygems”
rescue LoadError

load without gems

end

begin
require “faster_csv”
FCSV.build_csv_interface
rescue LoadError
require “csv”
end

class IP
def initialize(address)
@address_chunks = address.split(“.”).map { |n| Integer(n) }
raise AgumentError, “Malformed IP” unless @address_chunks.size == 4
end

def to_i
@address_chunks.inject { |result, chunk| result * 256 + chunk }
end

STRING_SIZE = new(“255.255.255.255”).to_i.to_s.size

def to_s
“%#{STRING_SIZE}s” % to_i
end
end

class IPToCountryDB
REMOTE = “software77.net?
action=download”
LOCAL = “ip_to_country.txt”

RECORD_SIZE = IP::STRING_SIZE * 2 + 5

def self.build(path = LOCAL)
open(path, “w”) do |db|
open(REMOTE) do |url|
csv = Zlib::GzipReader.new(url)

     last_range = Array.new
     csv.each do |line|
       next if line =~ /\A\s*(?:#|\z)/
       from, to, country = CSV.parse_line(line).values_at(0..1, 4).
                               map { |f| Integer(f) rescue f }
       if last_range[2] == country and last_range[1] + 1 == from
         last_range[1] = to
       else
         build_line(db, last_range)
         last_range = [from, to, country]
       end
     end
     build_line(db, last_range)
   end
 end

end

def self.build_line(db, fields)
return if fields.empty?
db.printf(“%#{IP::STRING_SIZE}s\t%#{IP::STRING_SIZE}s\t%s\n”,
*fields)
end
private_class_method :build_line

def initialize(path = LOCAL)
begin
@db = open(path)
rescue Errno::ENOENT
self.class.build(path)
retry
end
end

def search(address)
binary_search(IP.new(address).to_i)
end

private

def binary_search(ip, min = 0, max = @db.stat.size / RECORD_SIZE)
return “Unknown” if min == max

 middle = (min + max) / 2
 @db.seek(RECORD_SIZE * middle, IO::SEEK_SET)

 if @db.read(RECORD_SIZE) =~ /\A *(\d+)\t *(\d+)\t([A-Z]{2})\n\z/
   if    ip < $1.to_i then binary_search(ip, min,        middle)
   elsif ip > $2.to_i then binary_search(ip, middle + 1, max)
   else                    $3
   end
 else
   raise "Malformed database at offset #{RECORD_SIZE * middle}"
 end

end
end

if FILE == $PROGRAM_NAME
require “optparse”

options = {:db => IPToCountryDB::LOCAL, :rebuild => false}

ARGV.options do |opts|
opts.banner = “Usage:\n” +
" #{File.basename($PROGRAM_NAME)} [-d PATH] IP\n" +
" #{File.basename($PROGRAM_NAME)} [-d PATH] -r"

 opts.separator ""
 opts.separator "Specific Options:"

 opts.on("-d", "--db PATH", String, "The path to database file")

do |path|
options[:db] = path
end
opts.on(“-r”, “–rebuild”, “Rebuild the database”) do
options[:rebuild] = true
end

 opts.separator "Common Options:"

 opts.on("-h", "--help", "Show this message.") do
   puts opts
   exit
 end

 begin
   opts.parse!
   raise "No IP address given" unless options[:rebuild] or

ARGV.size == 1
rescue
puts opts
exit
end
end

if options[:rebuild]
IPToCountryDB.build(options[:db])
else
puts IPToCountryDB.new(options[:db]).search(ARGV.shift)
end
end

END

James Edward G. II

bbazzarrakk · September 25, 2007, 11:09pm

On Sep 16, 2007, at 7:21 PM, Bill K. wrote:

(256.256.256.256 or 0.0.0.-1, latter if comments are stripped out)
ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1
5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
US
(5.1.1.1 not found)

Well, we weren’t all broken:

$ ruby ip_to_country.rb 5.1.1.1
Unknown

James Edward G. II

bbazzarrakk · September 25, 2007, 11:10pm

“James Edward G. II” [email protected] wrote in message
news:4A414BD9-8B56-4897-A13E-

Well, we weren’t all broken:

I’ve sent my comment and refreshed new headers - and yes, your solution
came
in

–EK

bbazzarrakk · September 25, 2007, 11:10pm

On Sep 14, 2007, at 4:58 PM, diego scataglini wrote:

I think that the timing of the scripts are not a good index. It all
depends on what hardware/os you are running it on.
If we want to use speed as an index we should probably have J.E.
compare them all on the same machine.

I view it that we are getting a rough idea of speeds, not exact
counts. Both scripts timed so far seem to be able to answer the
question in under a second on semi-current hardware. Good enough for
me.

Maybe we could also write a ruby script that runs all the entry
scripts and time them, and that could be another ruby quiz which
will also be voted on speed and then we could write a ruby script
to time those entries an then we could … Just ignore this paragraph

Thank you for volunteering…

James Edward G. II

bbazzarrakk · September 25, 2007, 11:10pm

Wow, I’m impressed. Can’t wait to see that code!

Thanks.

Because startup time startet to take over the benchmark:

$ time ruby quiz139.rb 68.97.89.187 84.191.4.10 80.79.64.128
210.185.128.123
202.10.4.222 192.189.119.1
US
DE
RU
JP
AU
EU

real 0m0.078s
user 0m0.046s
sys 0m0.031s

and by the way: thanks for telling me such a database exists!

cheers

Simon

IP to Country (#139)

real 0m0.046s user 0m0.046s sys 0m0.015s

load without gems

Because startup time startet to take over the benchmark:

real 0m0.078s user 0m0.046s sys 0m0.031s

real 0m0.046s
user 0m0.046s
sys 0m0.015s

real 0m0.078s
user 0m0.046s
sys 0m0.031s