For performance, write it in C - Part 3, Source code now ava

On 8/3/06, Charles O Nutter [email protected] wrote:

On 8/2/06, Jon H. [email protected] wrote:

[SNIP]

Oh, and maybe Java 6 too? It’s like 20-30% faster than Java 5…I don’t
know
how Sun does it.

Because they held back with prior versions, I do not claim that they
did
this on purpose, but it is my experience in the closed source world time
to
market is more important than quality.
Now please let me explain. Java is an excellent product [ I do not like
it
at all, but so what :frowning: ] and had a good VM and excellent JIT all the
time so
why would management focus on peformance the focused on features and I
would
have done the same.
Apparently all the new features of 1.5 were optimizable, which they did.
After all a natural approach and a good job.
If only Java were open source and a programming language for that matter
;).

BTW Ruby does the same, it is the most feature rich language of the
world
and slow.
Rite will be as fast as lightning, am I rite?

Cheers
Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

Peter H. wrote:

I will take the revised Ocaml version and time it and check the output
tonight.

Yeah. I didn’t bother doing that. :wink:

Yup, got that. Will time and test it tonight.

Charles O Nutter wrote:

I’d be interested in seeing all of these run without IO as well…that
was
the big bottleneck for Java.

Can we just clear something up here. The Latin squares mini project was
not plucked out of the air just to be a poster child for my first post.
It was part of a project that I am working on and as such the program
has to produce some output! It does not make any sense to optimise the
program in any language if we are not going to produce any output.
Otherwise I would go with this heavily optimised C version:

[peterhickman]$ cat latin.c
int
main(int argc, char *argv[])
{
return 0;
}

This is an application, it has to produce output. Cease this willy
waving otherwise you are saying “For performance, remove the output”.

Jon H. wrote:

Peter H. wrote:

I will take the revised Ocaml version and time it and check the output
tonight.

Yeah. I didn’t bother doing that. :wink:

Not wanting to name and shame but there was one submission (off list)
that was faster for completely the wrong reasons. Besides I get a warm
fuzzy feeling when the results match because the pain of checking which
program is actually wrong is too great. As my Perl, Java and C versions
all use the same Perl to pre compute their values they all produced the
same results, consistent yes but are they correct? I’m pretty confident
now, especially now that I have independently written programs produce
the same results.

The next stage, which is to stop it reporting rotational and flipped
versions of a grid, is going to be quite a grind to test.

Any views on Practical Ocaml by Joshua B. Smith, does it look like it
might be a good book to learn from?

I will add it to the page.

On Thu, 03 Aug 2006 01:08:21 +0900, Peter H. wrote:

Here’s how it goes on my box:

[Latin]$ time ruby latin2.rb > r5

real 0m16.283s
user 0m15.380s
sys 0m0.498s

Which means that your computer is about as crap as mine :slight_smile:

Yes. I was really suprised to see that your timings where longer than
mine :slight_smile: But I am happy with the speed of my computer, though emacs
startup is a bit slow… :slight_smile:

I will add it
to the pages and make it a download. I think I need a table summarising
the timings.

I just realised that it is better to move the permutations on the
row string (the ones that use String#tr) outside the row permutations,
to avoid recalculating the rows each time. The only method I changed is
print_solution. The program runs almost twice as fast now! (0m7.8s on
my
computer).

Regards,
Kristof

----------------- latin.rb, version 2 -----------------------

require ‘permutation’

$size = (ARGV.shift || 5).to_i
MaxVal = $size-1

RowPermutations = Permutation.new($size).map{|p| p.value}
BaseStr = (2…$size).to_a.join
StrPermutations = Permutation.new($size-1).map{|p| p.project(BaseStr)}

StartColumns = (1…MaxVal).to_a
def init_columns(el)
a = StartColumns.dup
a.delete_at(el-1)
return a
end

def insert(sqr, num, row, columns)
insert(sqr, num, row+1, columns) if (row == num)
columns.each do |col|
next if sqr[row][col] != ?.
sqr[row][col] = num + ?1
if (row == MaxVal)
insert(sqr, num+1, 1, init_columns(num+1))
elsif (num == MaxVal && row == MaxVal - 1)
print_solution(sqr)
else
insert(sqr, num, row+1, columns - [col])
end
sqr[row][col] = ?.
end
end

def print_solution(sqr)
StrPermutations.each do |sp|
newsqr = sqr.map { |r| r.tr(BaseStr, sp) }
RowPermutations.each do |rp|
rp.each do |r|
print newsqr[r]
print “:”
end
puts
end
end
end

$square = [(“1” + BaseStr)] +
Array.new($size-1) { |i| (i+2).to_s + “.” * ($size - 1) }

insert($square, 0, 1, StartColumns)

On Thu, 03 Aug 2006 21:59:40 +0900, Peter H. wrote:

Not wanting to name and shame but there was one submission (off list)
that was faster for completely the wrong reasons. Besides I get a warm
fuzzy feeling when the results match because the pain of checking which
program is actually wrong is too great. As my Perl, Java and C versions
all use the same Perl to pre compute their values they all produced the
same results, consistent yes but are they correct? I’m pretty confident
now, especially now that I have independently written programs produce
the same results.

I used the following tests:

“ruby latin.rb | wc -l” to test if the number of squares matches the
number that is written in the wikipedia article.
Then “ruby latin.rb | sort | uniq | wc -l” shows if all the squares that
are output by the program are different.
All that remains to be done is to check that each latin square is
actually a latin square. That wouldn’t be so much work, but I just
looked
at some of the output, and trusted that the rest of the squares are
correct too :slight_smile:

Regards,
Kristof

Chad P. wrote:

I find that I pick up a language faster if
there’s a good community for it whose means of communication suits the
way I think, though. What sort of online OCaml community resources are
available?

Primarily, the online resources come from one of the two mailing lists,
OCaml-Beginners[1] for introductory to intermediate questions, and the
main OCaml list[2] for intermediate to advanced questions, or questions
on the OCaml internals and whatnot. There are a few other places where
info moves, like Richard J.’ cocan wiki [3], but at the moment
things are primarily handled through the lists (though it might be
about time for something like perlmonks to arrive as the community is
growing).

[1] Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos
[2] http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
[3] http://wiki.cocan.org/

Peter H. wrote:

Any views on Practical Ocaml by Joshua B. Smith, does it look like it
might be a good book to learn from?

Hard to say, as it won’t be available for another three months, and I
can’t even find a sample table of contents…

However, in the meantime, some of the best resources are freely
available. In addition to parts I, III, and IV of the OCaml manual, I
recommend anyone starting with OCaml to begin with Jason Hickey’s book
[1], and then look at the translation of the older French O’Reilly book
[2] (note however, that the book is a bit out of date, but this only
crops up as a real issue a few times – feel free to hit the
ocaml-beginners mailing list [3] with questions). Richard J. also
has a pretty nice tutorial available online [4]. Once you get going,
there are more specific resources for tools like OCamllex, ocamlyacc,
and camlp4.

[1] http://files.metaprl.org/doc/ocaml-book.pdf
[2] Developing applications with Objective Caml
[3] Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos
[4] http://www.ocaml-tutorial.org/

Peter H. wrote:

Otherwise I would go with this heavily optimised C version:

[peterhickman]$ cat latin.c
int
main(int argc, char *argv[])
{
return 0;
}

This is an application, it has to produce output. Cease this willy
waving otherwise you are saying “For performance, remove the output”.

  1. Over the past 3 or 4 days, people have said what’s wrong with the
    way you’re producing output in Java, and suggested what you should do
    instead. Is it that you don’t know enough Java to follow their
    suggestions and fix the program?

A couple of days ago single measurements on my machine gave
0.820s user+sys, gcc -pipe -Wall -O3 -fomit-frame-pointer
-funroll-loops -march=pentium4
4.444s user+sys, sun-jdk-1.5.0.07

Simply writing Java that converts the double-byte Unicode Java
outputStrings to single-byte /once/ instead of everytime there’s a
print statement, and following the usual Java pattern of wrapping
output with a BufferedOutputStream, reduces that to

1.396s user+sys, sun-jdk-1.5.0.07

Yesterday I emailed you another version of your Java generating Perl
script that fixes those print problems, the output matches the C
program output.

  1. If this is an application, and it has to produce output, and it has
    to produce output for something larger than 6x6, then I think you need
    a better algorithm.

You’ve mentioned the 5x5 C program took 2.473s, and the 6x6 took 9900s

  • at that rate might we expect 7x7 in 15 months?

The thing that was bothering me was “have I missed any?” Sure the
results I had were correct but I was unsure if they were complete. For
example the Perl permutation module and the Ruby one could perhaps share
some libpermute.so that I was unaware of and therefore harbour the same
error - this is not the case however. Or even be both developed from the
same source. Paranoia I know but many libraries and modules are based on
existing code rather than developed from scratch so bring subtle bugs
with them. Like the one recently found in the binary search, see
Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken – Google Research Blog.

But given that there is no relationship between the Perl, Ruby and Ocaml
versions I am sure that there will be no surprises.

Isaac G. wrote:

  1. Over the past 3 or 4 days, people have said what’s wrong with the
    way you’re producing output in Java, and suggested what you should do
    instead. Is it that you don’t know enough Java to follow their
    suggestions and fix the program?

So far the only language that has come close to running as fast as the C
is the Ocaml version and since then I have found other ways to speed up
the C version. Just as I have posted the Ruby and Ocaml version if any
certified Java expert cares to show me how it should be done (cough
Charles O Nutter cough) then I will be happy to time it and check the
output and report the results. That way we can get away from all this
stupidity like the fact that I didn’t use the correct naming convention.
Which we all know hugely affects performance. But remember that the
timings will be based on a program that produces output.

1.396s user+sys, sun-jdk-1.5.0.07

Yesterday I emailed you another version of your Java generating Perl
script that fixes those print problems, the output matches the C
program output.

If it was the email I replied to yesterday then reason that it got close
to the speed of the C version was because it didn’t do any output. Once
I put the output back in it was only around 5 seconds faster than my own
Java version and around 15 seconds slower than the C version. Hardly
encouragement to drop C and pursue Java. Remember I am running the
program to get the output.

  1. If this is an application, and it has to produce output, and it has
    to produce output for something larger than 6x6, then I think you need
    a better algorithm.

You’ve mentioned the 5x5 C program took 2.473s, and the 6x6 took 9900s

  • at that rate might we expect 7x7 in 15 months?

Actually the real problem is the amount of disk space I need to store
the output. 6 x 6 resulted in 32Gb of output. The 7 x 7 will probably
eat my hard disk well before it completes. I’m not sure that the 1Tb I
have will be enough so I am looking at removing the rotational and
mirrored versions of the grids from the output.

Jon H. wrote:

OCaml 0.161s 0.626s ocamlopt -inline 100 -unsafe

let n = fact size
if !i = 0 then 1 else begin
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)
list))
print_string op;
print_string “\n”

A small point.
print_endline op
is shorter. Is it faster?

Peter H. wrote:

Charles O Nutter cough) then I will be happy to time it and check the
output and report the results.

So you don’t know enough Java?

outputStrings to single-byte /once/ instead of everytime there’s a
to the speed of the C version was because it didn’t do any output. Once
I put the output back in it was only around 5 seconds faster than my own
Java version and around 15 seconds slower than the C version. Hardly
encouragement to drop C and pursue Java. Remember I am running the
program to get the output.

If it didn’t produce any output I would have mentioned that, as I have
before.

Check your email - the subject line reads " For performance use ascii
output"

On 8/3/06, Charles O Nutter [email protected] wrote:

“The numbers [image: N(n,n)] of Latin squares of order [image: n==1], 2,
… are 1, 2, 12, 576, 161280, … (Sloane’s A002860http://www.research.att.com/~njas/sequences/A002860
).”

That didn’t come through right…too-smart copying and pasting on
Firefox’s
part.

“The number N(n, n) of Latin squares of order n = 1, 2, … are 1, 2,
12,
576, 161280, …”

On 8/3/06, Peter H. [email protected] wrote:

certified Java expert cares to show me how it should be done (cough
Charles O Nutter cough) then I will be happy to time it and check the
output and report the results. That way we can get away from all this
stupidity like the fact that I didn’t use the correct naming convention.
Which we all know hugely affects performance. But remember that the
timings will be based on a program that produces output.

My original two emails about the Java version described in detail what
changes were needed to fix it. I also offered to post my version of the
code. I never got any response to that email, and you’ve still got the
old
29-second numbers posted on your page. I find it hard to believe you
actually want to fix the Java version when numerous people have told
you
exactly how to do so, and you seem to have missed all of them.

Yesterday I emailed you another version of your Java generating Perl

script that fixes those print problems, the output matches the C
program output.

If it was the email I replied to yesterday then reason that it got close
to the speed of the C version was because it didn’t do any output. Once
I put the output back in it was only around 5 seconds faster than my own
Java version and around 15 seconds slower than the C version. Hardly
encouragement to drop C and pursue Java. Remember I am running the
program to get the output.

You may have to accept the fact that the G4 version of Java is not as
fast
as current versions. I doubt there’s a Java 6 for the G4, but it would
be
worth trying if there is. Everyone except you seems to have no trouble
getting the Java version to run fast, even with output.

Ultimately, however, I don’t think you really care…if you’re trying to
get
the job done and this is the best Java code you could write, perhaps you
should not use Java. Sorry to put it bluntly, but benchmarking and Java
programming do not appear to be your strong areas.

Actually the real problem is the amount of disk space I need to store

the output. 6 x 6 resulted in 32Gb of output. The 7 x 7 will probably
eat my hard disk well before it completes. I’m not sure that the 1Tb I
have will be enough so I am looking at removing the rotational and
mirrored versions of the grids from the output.

I think you’re goint to have some trouble regardless of the algorithm.
The
data set for latin squares grows extremely fast, and you’re unlikely to
ever
get above 6x6 calculating in realtime.

See Latin Square -- from Wolfram MathWorld

“The numbers [image: N(n,n)] of Latin squares of order [image: n==1], 2,

are 1, 2, 12, 576, 161280, … (Sloane’s
A002860http://www.research.att.com/~njas/sequences/A002860
).”

I think you’re going to run out of space pretty quick, not to mention
the
time it’s going to take to calculate.

On Fri, Aug 04, 2006 at 12:40:22AM +0900, [email protected] wrote:

on the OCaml internals and whatnot. There are a few other places where
info moves, like Richard J.’ cocan wiki [3], but at the moment
things are primarily handled through the lists (though it might be
about time for something like perlmonks to arrive as the community is
growing).

[1] Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos
[2] http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
[3] http://wiki.cocan.org/

Thanks. I’ll look into it.

You mention it needs something like PerlMonks – which is interesting,
considering I’ve noticed there’s nothing like that for Ruby, either.
I’ve found myself wishing a couple of times that there was something
like a RubyMonks kinda thing (though in a more Rubyish idiom than *Monks
would be), but every time I consider the notion I find myself wishing
that PerlMonks were just a bit more multi-language, focusing primarily
on languages I personally find interesting. Bah.

I’ve considered putting something vaguely like that together myself,
actually, but I’m lazy and too much of a perfectionist. Since I’d be
working toward my own project spec to create such a website, I’d
probably just end up spending months working on what I want the site to
do and never get around to writing a line of code.

Jon H. wrote:

let incompat =
let aux x y =
Array.iteri (fun i px -> if px = perms.(y).(i) then raise Exit) perms
(x);

Perhaps google has mangled you code; at this point ocaml reports

File “latin-squares-H.ml”, line 30, characters 4-15:
This function is applied to too many arguments,
maybe you forgot a `;’

Jon H. wrote:

Here’s my OCaml2 (based upon William’s):

Just a small note here:
Depending on your goals, you can do even better by stuffing the output
in a buffer then dumping the buffer at the end of computation (see
example code below). Of course this consumes more memory, and isn’t
possible on a larger board, but it’s yet another tweak to exploit,
since we’re benchmarking output in addition to benchmarking work (note
that on my G5 output to stdout is a bit slower than your version, but
this isn’t the case on the AMD machines I have access to – redirecting
to a file is always faster).

let rec add_a_row row =
let buf = Buffer.create 1_000_000 in
let rec addh row =
match row with
| x when x=size ->
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
Buffer.add_string buf op; Buffer.add_char buf ‘\n’
| _ ->
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then (board.(row) <- latest; addh (row + 1))
done in
addh row;
Buffer.output_buffer stdout buf

let () =
add_a_row 0;