[ANN] Multidimensional Array - MDArray 0.5.3

Sadaf_N · June 24, 2013, 7:42pm

Announcement

MDArray version 0.5.3 has Just been released. MDArray is a multi
dimensional array implemented for JRuby inspired by NumPy
(www.numpy.org)
and Masahiro Tanakas Narray (narray.rubyforge.org). MDArray stands on
the
shoulders of Java-NetCDF and Parallel Colt. At this point MDArray has
libraries for mathematical, trigonometric and descriptive statistics
methods.

NetCDF-Java Library is a Java interface to NetCDF
fileshttp://www.unidata.ucar.edu/software/netcdf/index.html,
as well as to many other types of scientific data formats. It is
developed
and distributed by Unidata (http://www.unidata.ucar.edu).

Parallel Colt (
http://grepcode.com/snapshot/repo1.maven.org/maven2/net.sourceforge.parallelcolt/parallelcolt/0.10.0/)
is a
multithreadedhttp://en.wikipedia.org/wiki/Thread_(computer_science)
version
of Colt http://dsd.lbl.gov/~hoschek/colt/ (
redirect...). Colt provides a set of Open Source
Libraries for High Performance Scientific and Technical Computing in
Java.
Scientific and technical computing is characterized by demanding problem
sizes and a need for high performance at reasonably small memory
footprint.

Whats new*:*

Performance Improvement

On previous versions, array operations were done by passing a Ruby Proc
to
a loop for all elements of the given arrays. For instance, adding two
MDArrays was done by passing Proc.new { |a, b| a + b } and looping
through
all elements of the arrays. Procs are very flexible in Ruby; however,
from
my experience with MDArray, also very slow.

On this version, when available, instead of passing a Proc to the loop,
we
pass a native Java method. Available Java methods are those extracted
from
Parallel Colt and listed below. Note that Parallel Colt has native
methods
for the following types only: double, float, long and int. With
this change, there was a performance improvement of over 90%, and using
MDArray operations is close to native Java operations. We expect (but
have
not yet benchmarking data) that this brings MDArray performance close to
similar solutions such as NArray, NMatrix and NumPy (please try it, and
if
this assertion is false, Ill be glad to change it in future
announcements).

Methods not available in Parallel Colt but supported by Ruby, such as
sinh, tanh, and add for byte type, etc. are still supported by
MDArray. Again, to improve performance, instead of passing a Proc we
now
create a class as follows

class Add

                   *def self.apply(a, b)

                          a + b

                   end

         end

This change brought performance improvement of over 60% for MDArray
operations with Ruby methods.

Experimental Lazy Evaluation

Usual MDArray operations are done eagerly, i.e., if @a, @b, @c are three
MDArrays then the following:

            @d = @a + @b + @c

will be evaluated as follows: first @a + @b is performed and stored in a
temporary variable, then this temporary variable is added to @c. For
large
expressions, temporary variables can have significant performance
impact.

This version of MDArray introduces lazy evaluation of expressions.
Thus,
when in lazy mode:

            @lazy_d = @a + @b + @c

will not evaluate immediately. Rather, the expression is preprocessed
and
only executed when required. Since at execution time the whole
expression
is known, there is no need for temporary variables as the whole
expression
is executed at once. To put MDArray in lazy mode we only need to set
its
mode to lazy with the following command MDArray.lazy = true. All
expressions after that are by default lazy. In lazy mode, MDArray
resembles Numexpr, however, there is no need to write the expression as
a
string, and there is no compilation involved.

MDArray does not implement broadcasting rules as NumPy. As a result,
trying to operate on arrays of different shape raises an exception. On
lazy mode, this exception is raise only at evaluation time, so it is
possible to have an invalid lazy array. To evaluate a lazy array one
should use the [] method as follows:

            @d = lazy_d[]

@d is now a normal MDArray.

Lazy MDArrays are really lazy, so lets assume that @a = [1, 2, 3, 4] and
@b = [5, 6, 7, 8]. Lets also have @l_c = @a + @b. Now doing @c =
@l_c[],
will evaluate @c to [6, 8, 10, 12]. Now, lets do @a[1] = 20 and then @d

@l_c[]. Now @d evaluates to [25, 8, 10, 12] as the new value of @a is
used.

Lazy arrays can be evaluated inside expressions:

            @l_c = (@a + @b)[] + @c

In this example, @l_c is a lazy array, but (@a + @b) is evaluated when
the
[] method is called and then added to @c. If now the value of @a or @b
is changed, the evaluation of @l_c will not be changed as in the
previous
example.

Finally, laziness is contagious. So, lets assume that we have @l_c as
above, a lazy array and we do MDArray.lazy = false. From this point on
in
the code, operations will be done eagerly. Now doing: @e = @d + @l_c,
@e
is a lazy array as its construction involves a lazy array. One should
be
careful when in eager mode mixing lazy and eager arrays:

            @c = @l_a + (@b + @c)

then, with parenthesis, first (@b + @c) is evaluated eagerly and then
added
lazily to @l_a, giving a lazy array.

In this version, Lazy evaluation is around 40% less efficient in one
machine I tested up to approximately the same performance in another
equipment than eager evaluation when only native Java methods (Parallel
Colt methods described below) are used in the expression. If expression
involves any Ruby method, evaluation of lazy expressions becomes much
slower than eager evaluation. In order to improve performance, I
believe
that compilation of expression will be necessary.

MDArray and SciRuby*:*

MDArray subscribes fully to the SciRuby Manifesto (http://sciruby.com/).

Ruby http://www.ruby-lang.org/ has for some time had no
equivalent
to the beautifully constructed *NumPy http://numpy.scipy.org/,
SciPyhttp://www.scipy.org/,
and matplotlib http://matplotlib.sourceforge.net/ libraries for **
Pytho http://www.python.org/n.*

We believe that the time for a Ruby science and visualization package
has
come. Sometimes when a solution of sugar and water becomes
super-saturated,
from it precipitates a pure, delicious, and diabetes-inducing crystal of
sweetness, induced by no more than the tap of a finger. So is occurring
now, we believe, with numeric and visualization libraries for Ruby.

MDArray main properties are*:*

     Homogeneous multidimensional array, a table of elements

(usually
numbers), all of the same type, indexed by a tuple of positive integers;

     Easy calculation for large numerical multi dimensional arrays;

     Basic types are: boolean, byte, short, int, long, float,

double,
string, structure;

     Based on JRuby, which allows importing Java libraries;

     Operator: +,-,*,/,%,**, >, >=, etc.;

     Functions: abs, ceil, floor, truncate, is_zero, square, cube,

fourth;

     Binary Operators: &, |, ^, ~ (binary_ones_complement), <<, >>;

     Ruby Math functions: acos, acosh, asin, asinh, atan, atan2,

atanh, cbrt, cos, erf, exp, gamma, hypot, ldexp, log, log10, log2, sin,
sinh, sqrt, tan, tanh, neg;

      Boolean operations on boolean arrays: and, or, not;

      Fast descriptive statistics from Parallel Colt (complete list

found bellow);

      Easy manipulation of arrays: reshape, reduce dimension,

permute,
section, slice, etc.;

     Reading of two dimensional arrays from CSV files (mainly for

debugging and simple testing purposes);

     StatList: a list that can grow/shrink and that can compute

Parallel Colt descriptive statistics;

     Experimental lazy evaluation (still slower than eager

evaluation).

Descriptive statistics methods imported from Parallel Colt*:*

auto_correlation, correlation, covariance, durbin_watson, frequencies,
geometric_mean, harmonic_mean, kurtosis, lag1, max, mean,
mean_deviation,
median, min, moment, moment3, moment4, pooled_mean, pooled_variance,
product, quantile, quantile_inverse, rank_interpolated, rms,
sample_covariance, sample_kurtosis, sample_kurtosis_standard_error,
sample_skew, sample_skew_standard_error, sample_standard_deviation,
sample_variance, sample_weighted_variance, skew,
split, standard_deviation, standard_error, sum, sum_of_inversions,
sum_of_logarithms, sum_of_powers, sum_of_power_deviations,
sum_of_squares,
sum_of_squared_deviations, trimmed_mean, variance, weighted_mean,
weighted_rms, weighted_sums, winsorized_mean.

Double and Float methods from Parallel Colt:

acos, asin, atan, atan2, ceil, cos, exp, floor, greater, IEEEremainder,
inv, less, lg, log, log2, rint, sin, sqrt, tan.

Double, Float, Long and Int methods from Parallel Colt:

abs, compare, div, divNeg, equals, isEqual (is_equal), isGreater
(is_greater), isles (is_less), max, min, minus, mod, mult, multNeg
(mult_neg), multSquare (mult_square), neg, plus (add), plusAbs
(plus_abs),
pow (power), sign, square.

Long and Int methods from Parallel Colt

and, dec, factorial, inc, not, or, shiftLeft (shift_left),
shiftRightSigned
(shift_right_signed), shiftRightUnsigned (shift_right_unsigned), xor.

MDArray installation and download*:*

     Install Jruby

     jruby S gem install mdarray

MDArray Homepages*:*

     http://rubygems.org/gems/mdarray

     https://github.com/rbotafogo/mdarray/wiki

Contributors*:*

Contributors are welcome.

MDArray History*:*

24/05/2013: Version 0.5.0  Over 90% Performance improvements for

methods imported from Parallel Colt and over 40% performance
improvements
for all other methods (implemented in Ruby);

 16/05/2013: Version 0.5.0 - All loops transferred to Java with over

50% performance improvements. Descriptive statistics from Parallel
Colt;

     19/04/2013: Version 0.4.3 - Fixes a simple, but fatal bug in

0.4.2. No new features;

     17/04/2013: Version 0.4.2 - Adds simple statistics and boolean

operators;

     05/04/2013: Version 0.4.0  Initial release.

rbermejo · June 24, 2013, 8:07pm

Wow, this is really excellent Rodrigo! Thank you for the extensive
post…I just tweeted about MDArray to spread the word.

The perf improvements sound excellent, and there’s even room to grow;
if we turned a few of your wrapper classes into JRuby extensions, we
could eliminate almost all of the Ruby-to-Java overhead and bump the
speed up even more. It’s nice to have a path forward to go even
faster, but MDArray is also an excellent demonstration of the power of
JRuby’s Java integration.

I am looking forward to seeing what else comes out of ScyRuby this
summer. Does MDArray fall under the same family of projects, or is it
mostly independent?

Charlie

On Mon, Jun 24, 2013 at 12:39 PM, Rodrigo B.

rbermejo · June 24, 2013, 9:41pm

Charles,

Thanks! Im really surprise at how fast MDArray can be and Im thrilled
with JRuby. JRuby really makes integration of Ruby-to-Java easy and
fast.
I look forward on possible extensions to JRuby to eliminate overhead.
Moving forward, I think Ill try to compile lazy expressions in order to
reap the benefits of temporaries elimination.

This is an independent project, mostly done on my free time.

Keep the good work with JRuby, its a great project!

Cheers,

Rodrigo

On Mon, Jun 24, 2013 at 4:04 PM, Charles Oliver N.

rbermejo · August 3, 2013, 10:19pm

Hi, this looks pretty interesting for image/graphical processing and all
sorts of other geometry stuff (numpy is quite sophisticated for this). I
had a bit of go at using MDArray with ruby-processing here
Ruby Processing: Using MDArray in ruby-processing,
but I’m sure I can use it for more exciting stuff.

rbermejo · August 5, 2013, 6:10pm

Hi Martin,

Thanks for your post.

I think that an interesting feature of MDArray is the ability to get
section from it. A section gets a subarray with the same backing
store. Since there is no data copying, cost is very low. So, Im
thinking
that you could do some animation with it. For instance, in your
example,
you could have another dimension time = 10. Here a small example that
fills all frames of the animation. My example data will probably not
show
anything nice though!

require ‘mdarray’

def animation

width = 5

height = 6

time = 10

animation = MDArray.float([time, width, height])

(0…time).each do |t|

# use section to get only one time frame. The last argument to

section
is ‘true’

# that gets a section with reduction, i.e., eliminates dimensions of

size 1.

# frame is a 2 dimensional mdarray

frame = animation.section([t, 0, 0], [1, width, height], true)

# fill frame with values for each time frame animation

(0...width).each do |w|

  (0...height).each do |h|

    frame[w, h] = t * w * h

  end

end

end

More dimensions could be used for instance if you wanted to divide the
screen if quarters.

You could also have an animation with many characters.

Animation = MDArray.float([time, characters, width, height])

Every character would be in its own dimension. You could work on every
character independently by getting their section:

char1 = animation.section([t, 1, 0, 0], [1, 1, width, height], true)

or getting all characters for a given time frame.

Frame1 = animation.section([t, 0, 0, 0], [1, characters, width, height])

I would love to see this with processing!

Cheers,

Rodrigo

rbermejo · August 6, 2013, 9:18pm

Before I saw your reply I came up with a revised version of my conways
game of life using MDArray
Ruby Processing: Conways Game of Life in ruby-processing (featuring MDArray).
I guess in this case instead of having to two MDArray instances I could
have used the one. Unfortunately I’ve come unstuck with my idea of image
processing. Since to use gem libraries with ruby-processing I need to
use an external jruby. Whereas to get sketches with PImage (loads images
as pixel arrays), I seem to need to use the jruby-complete (that is
included with ruby-processing). I have had bit of experience using numpy
and pyprocessing which I thought could be interesting.

rbermejo · August 7, 2013, 7:21am

I think I might have solved the jruby-complete.jar vs external jruby
issue, which is completely our fault. The jruby-complete.jar is in our
classpath, whether using it directly by starting from java, or calling
from jruby (not a good idea). If I remove the jruby-complete.jar I can
now run sketches that require runtime libraries from that classpath. I
will need a workaround for non-jruby installs (which can’t access gems
anyway) and for exported apps.

rbermejo · August 6, 2013, 10:00pm

Martin,

Yes, you could use just one MDArray with one extra dimension for the
buffer, although the code is quite clear the way it is. In this case I
dont see much gain to it.

I dont really understant the problem with having to load jruby-complete.
If you cannot intall gems, you could get the .rb files and .jar from
MDArray and work as if they were your own ruby and jar files by just
configuring the path properly. All files are available on the gem
directory or can be directly downloaded from gitHub.

Rodrigo

[ANN] Multidimensional Array - MDArray 0.5.3

Lazy MDArrays are really lazy, so lets assume that @a = [1, 2, 3, 4] and @b = [5, 6, 7, 8]. Lets also have @l_c = @a + @b. Now doing @c = @l_c[], will evaluate @c to [6, 8, 10, 12]. Now, lets do @a[1] = 20 and then @d

Lazy MDArrays are really lazy, so lets assume that @a = [1, 2, 3, 4] and
@b = [5, 6, 7, 8]. Lets also have @l_c = @a + @b. Now doing @c =
@l_c[],
will evaluate @c to [6, 8, 10, 12]. Now, lets do @a[1] = 20 and then @d