For performance, write it in C - Part 3, Source code now ava

On Fri, Aug 04, 2006 at 11:30:11AM +0900, William J. wrote:

Jon H. wrote:

but your timings are probably more accurate because your computer is so
slow. :wink:

It’s 800MHz. I ran each program 4 times and took the average.

I guess that means we should be testing everything on the laptop I’m
currently using as a remote terminal for my server. It’s a 366MHz
dinosaur.

. . . except I’m doing real work right now, and don’t need to be bogging
stuff down.

Peter H. wrote:

version is now the fastest.

real 0m3.660s
user 0m1.958s
sys 0m1.421s

The source code for both are available on the web site for you to examine.

I stole code and ideas from Jon H., so this should be called
the James/Harrop version.

Mmmm. Your code convinces me, you have a sale.

When I get home I’ll time it and add it.

I’m beginning to wonder if my computer is slow enough to give meaningful
timings as the Ocaml versions seem to be pushing for sub second timings
:slight_smile:

William J. wrote:

But the big news is that William J.’ revision of his previous Ocaml
version is now the fastest.

real 0m3.660s
user 0m1.958s
sys 0m1.421s

The source code for both are available on the web site for you to examine.

I stole code and ideas from Jon H., so this should be called
the James/Harrop version.

I just noticed that your site doesn’t have my last version (the 4th).
Here it is again:

(* Thanks to Jon Harrup for code and ideas. )
(
compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)

(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
(hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ]

let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list

let perms = Array.of_list (permute list)
let n = Array.length perms

(* Boolean array used to determine if one row is
compatible with another. *)
let compatible = Array.make_matrix n n true ;;
Array.iteri (fun x ex ->
Array.iteri (fun y ey ->
compatible.(x).(y) <- List.for_all2 (<>) ex ey) perms ) perms

let join list = String.concat “” (List.map string_of_int list)
let output_strings = Array.map join perms

(* For speed, create a string that’s the length of the lines
that we’ll print; the :'s that aren’t needed as separators
will later be overwritten. )
let output_line = String.make (size
(size+1)-1) ‘:’ ^ “\n”
let board = Array.make size 0

(* A recursive function. )
let rec add_a_row row =
if row = size then
( for i=0 to size-1 do
String.blit
output_strings.(board.(i)) 0 (
source )
output_line (i
(size+1)) (* dest )
size
done;
print_string output_line
)
else
for latest = 0 to n - 1 do
let compatible_slice = compatible.(latest) in
(
Create a changeable thing (variable). )
let prev_row = ref 0 in
(
The ! below fetches the variable’s value. *)
while !prev_row < row &&
compatible_slice.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;

add_a_row 0

Peter H. wrote:

When I get home I’ll time it and add it.

I’m beginning to wonder if my computer is slow enough to give meaningful
timings as the Ocaml versions seem to be pushing for sub second timings :slight_smile:

In order to be fair to gcc, I’d like to point out one thing.
Although I used your algorithm, I tweaked it in an attempt to make
it faster. So I think that you should take my add_a_row function
and translate it back to C.

I changed the use of the bool data type (which is really an int) to char
and shrunk the size of the program and sped it up.

real 0m1.885s
user 0m1.649s
sys 0m0.168s

I think this is because the top most loop, when processing the top row,
basically just zooms down a given row of the compared array. With four
chars being packed in where one bool would be more of the tests will be
using data that is available in the CPUs cache.

Peter H. wrote:

Time for another update.

Isaac G. provided a Java implementation based on mine (ie still pre
computes the tables in Perl) that brought the times down to sub 9 seconds.

real 0m8.966s
user 0m5.815s
sys 0m1.488s

sub 9 seconds?
7.3s
“real” is elapsed time, which includes all those other processes that
grabbed CPU after you gave the time command.

Peter H. wrote:

Time for another update.

The source code for both are available on the web site for you to examine.

That URL again? -Tim

Peter H. wrote:

will be using data that is available in the CPUs cache.

Sorry but I have to retract these timings. They were from the tests to
find the best gcc compiler settings and do not include the Perl pre
calculation for each run so they cannot be compared to any of the other
timings. Sorry about that.

Isaac G. wrote:

sub 9 seconds?
7.3s
“real” is elapsed time, which includes all those other processes that
grabbed CPU after you gave the time command.

This is a good point but in all the posts where I have mentioned the
time taken I have used the real time (which is the best case from 10
runs - except for the first Perl version) all the way back to the first
and fourth versions in Perl (473 and 12 minutes). However it does not
affect the ordering except to push my improved C version ahead of
William J.’ revised Ocaml version by just 0.029 of a second.

This late in the proceedings it would only confuse the issue to start
saying that the first Perl version ran in 251 minutes when 473 minutes
has been mentioned several times. Besides now that we are getting into
such short times for the Ocaml and C versions the difference between
real and user + sys is sub second and it is becoming harder to say that
one is significantly faster then the other. One more run of William
James’ program and it might make up that 0.029 of a second difference.

Perhaps stepping up to a 6 x 6 grid would allow more meaningful timings?

http://peterhi.dyndns.org/write_it_in_c/index.html

Peter H. wrote:

four chars being packed in where one bool would be more of the tests
will be using data that is available in the CPUs cache.

Sorry but I have to retract these timings. They were from the tests to
find the best gcc compiler settings and do not include the Perl pre
calculation for each run so they cannot be compared to any of the other
timings. Sorry about that.

I wish that you would dump Perl and do everything in C.

I’d be curious to see a python version, as long as we’re so far off the
ML
topic. This thread is about solving scripting language speed issues by
resorting to a different language, but python seems to benchmark
considerably better than ruby.

Anyone willing to take a crack at it?

Peter H. wrote:
-snip-

if any certified Java expert cares to show me how it should be done (cough
Charles O Nutter cough) then I will be happy to time it and check the
output and report the results. That way we can get away from all this
stupidity …
-snip

Now do you understand that when you wrote “C is still faster by an
order of magnitude” it had nothing to do with Java and had everything
to do with your “bogus” program.

In other words - Charles O Nutter was right, and you were wrong.

Depending on the machine hardware, we should expect a vanilla Java
implementation to take 1.5x - 2.5x longer than C for this problem.

On Sat, 05 Aug 2006, Jon H. defenestrated me:

Isaac G. wrote:

Depending on the machine hardware, we should expect a vanilla Java
implementation to take 1.5x - 2.5x longer than C for this problem.

Only if by “vanilla” you mean optimised Java.

Did you look at Isaac’s implementation? It is almost exactly the same
it
was before, but he is not printing out the latin squares as unicode
(which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).

I think of things like manual loop unrolling and other similiar
techniques
as optimization techniques. The stuff he did was common sense and I
think
closer to the C impl in action (these languages are not the same – some
consideration should be taken in this).

My only question on this thread about Java is what Java is being used
for the test (used on the webpage – a good addition to the page)? I
ran
Java 1.4, Java 5 and Java 6 (beta but what the hell). The speedup was
surprising and exciting (obviously I do Java programming). I went from
1.1s to 0.84s between the versions.

-Tom

William J. wrote:

I wish that you would dump Perl and do everything in C.

I posted an OCaml implementation that used an imperative, array-based
permuter. We could just port that back to C (or find the site with the C
code that I translated).

Jon H. wrote:

Isaac G. wrote:

Depending on the machine hardware, we should expect a vanilla Java
implementation to take 1.5x - 2.5x longer than C for this problem.

Only if by “vanilla” you mean optimised Java.


Dr Jon D Harrop, Flying Frog Consultancy
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists

What things in that program do you consider optimizations?

I also provided a Java program with the same inner loop code as the
OCaml program but it hasn’t appeared on Peter H.'s website yet.

Thomas E Enebo wrote:

Only if by “vanilla” you mean optimised Java.

Did you look at Isaac’s implementation? It is almost exactly the same
it
was before, but he is not printing out the latin squares as unicode (which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).

Yes.

I think of things like manual loop unrolling and other similiar
techniques
as optimization techniques.

They are all optimisations.

The stuff he did was common sense and I think
closer to the C impl in action (these languages are not the same – some
consideration should be taken in this).

These languages are not the same. However, using “ordinary” IO via
System.out.println or cout or print_endline is the “vanilla” way of
outputting in all of these languages. Of these languages, only Java
gives
appalling performance unless you switch to buffers.

Even when you do use buffers in Java, it is still slower than unbuffered
C++, OCaml etc. on my machine.

My only question on this thread about Java is what Java is being used
for the test (used on the webpage – a good addition to the page)?

I was using Sun’s J2SE 1.5. I also tried GNU’s GIJ, which is even slower
than OCaml’s bytecode.

Peter H. wrote:

sys 0m1.488s
This is a good point but in all the posts where I have mentioned the
time taken I have used the real time (which is the best case from 10
runs - except for the first Perl version) all the way back to the first
and fourth versions in Perl (473 and 12 minutes). However it does not
affect the ordering except to push my improved C version ahead of
William J.’ revised Ocaml version by just 0.029 of a second.

This late in the proceedings it would only confuse the issue to start
saying that the first Perl version ran in 251 minutes when 473 minutes
has been mentioned several times.

Having repeatedly made a mistake in the past is no reason to continue
making the same mistake in the future!

(You’re timing for Perl includes 3 or 4 hours of you surfing the web or
installing OCaml or whatever!)