Calculations on lists of numbers

[email protected] wrote:

/ …

the main reason that i use cat in real life in situations like this is
because
hitting up arrow and changing mean -> sum, for instance, is easier! :wink:

Yes, same here. Also I think there may be an aesthetic factor at work –
to
many, the pipe format just looks “nicer” than the redirection symbol.
Purely subjective, of course.

On Sun, 3 Dec 2006, Paul L. wrote:

Why drag in the cat when it’s utterly superfluous?

mean <list
sum <list
minmax <list

Yes, true, but in a simple example like this, ‘cat’ is just a stand-in for
some other application that would stream the numbers. In such a case, the
pipe seems more appropriate.

the main reason that i use cat in real life in situations like this is
because
hitting up arrow and changing mean -> sum, for instance, is easier! :wink:

-a

Olivier wrote:

It is very fast, since it uses no memory : the values are not stored
internally, just the sub-results (so, a list of 2 values will use the same
amount of memory than a list of a billion values), and also because the
method that adds a value is generated depending of what stats you want to
compute.

What’s the secret to computing stdev in bounded space? The formulas I
know (I am not much of a statistician) require you to know the mean in
advance.

Do you do it in two passes through the data, first getting the mean and
then the stdev? (But this would not work if you are reading data from
stdin and don’t want to cache the data in memory.)

On 02.12.2006 05:02, [email protected] wrote:

and have never found one. right now i’m building a ruby version - before i
continue, does anyone know a standard unix or ruby version of this?

I am by no means an expert in numerical processing but maybe bc or dc
can be made to function that way. Other than that my first line of
defense would probably be awk - if you want to prevent usage of Ruby.
:slight_smile:

Kind regards

robert

[email protected] wrote:

cat list | mean
terribly important to do the search you are suggesting.
i’ve found that my productivity increases in an exponential way if i simply
functions and is user-extensible via the use of duck-typed filters. it’s also

5 6
mussel:~/eg/ruby/listc > ./listc minmax < input.c
mussel:~/eg/ruby/listc > cat input.d

#–{{{
MeanFilter
end
def Main(*a, &b) Main.new(*a, &b).main end
port << join(’ ')
class MultiList < Array
end
def << line
end
def << line
@sum.zip(@count){|s,c| mean << (s.to_f/c.to_f)}
def initialize
end
#–{{{
@minmax[i][0] = [ @minmax[i][0], n ].min
#–{{{
end
#–}}}
end

Main() if FILE == $0

This is elegance?
To me it seems pompous, pretentious prolixity, but it probably appeals
to the bureaucratic mind which seeks a pretext for squandering
money taken by force from the citizens.

The simple way:

ops = {
:sum, proc{|a| a.inject{|x,y| x+y}},
:mean, proc{|a| a.inject{|x,y| x+y}/a.size.to_f},
:min, proc{|a| a.min},
:max, proc{|a| a.max},
:minmax, proc{|a| “#{ a.min }:#{ a.max }”}
}
ops[:add] = ops[:sum]
ops[:avg] = ops[:mean]

op = ops[ ARGV.shift.to_sym ] or
abort “op not in #{ops.map{|k,v| k}.join(’,’)}”

data = ARGF.to_a.map{|line| line.split.
map{|s| Integer(s) rescue Float(s) rescue nil}.compact}
max_n = data.map{|a| a.size}.max
data = data.map{|a| a + [nil] * (max_n - a.size) }.transpose

puts data.map{|a| op.call( a.compact )}.join(’ ')

Joel VanderWerf [email protected] writes:

What’s the secret to computing stdev in bounded space? The formulas
I know (I am not much of a statistician) require you to know the
mean in advance.

That would be strange.

Do you do it in two passes through the data, first getting the mean
and then the stdev? (But this would not work if you are reading data
from stdin and don’t want to cache the data in memory.)

No. There are several ways to do this. Basically you want to have
sqrt(sum (x- (sum x/n))^2/(n-1))
which is
sqrt((sum x^2 - ((sum x)^2/n))/(n-1))
Now while you can sum this as you go, it is not numerically stable if
you are unlucky.

Paul L. wrote:

cat list | minmax
It’s a bit too late to disagree, in the face of the evidence that I said it,
then I did it.

I would like to know if a unix version exists, since it will
certainly be faster than ruby, run in less memory, and probably exist in
environments where ruby doesn’t. So I think the search is worth while.

Yes, all true, but that isn’t what you disagreed with.

I read your statement as

(It is so easy to create in Ruby, a matter of minutes) THEREFORE (it is
not terribly important to do the search you are suggesting)

I was in fact disagreeing with all three parts of your statement,
premise, deduction, and conclusion:

  • A matter of minutes is not enough. These things take a little more
    thinking, even in a programming language as elegant as ruby. For
    example, it took a little thought to notice the #to_f vs. #Float
    distinction. Also, some time should be spent on unit tests. I stumbled
    around with these sorts of things for close to an hour.

  • Even if the premise were true, it would not follow from this premise
    that a search is not important. What does the length of time it takes
    you to write a piece of code have to do with being aware of prior art?

  • In any case, a search is important. One should always be aware of
    prior art, for many reasons.

There’s nothing wrong with posting code without exhaustive analysis,
unit testing, and search of prior art. I do it all the time, but I try
to have a little humility about it.

Joel VanderWerf wrote:

What’s the secret to computing stdev in bounded space? The formulas I
know (I am not much of a statistician) require you to know the mean in
advance.
Standard deviation - Wikipedia

The right side (comes from applying FOIL to the left side) can be
computed in one pass in bounded space.

Devin

Le samedi 02 décembre 2006 21:07, Joel VanderWerf a écrit :

  • the median
    know (I am not much of a statistician) require you to know the mean in
    advance.

Do you do it in two passes through the data, first getting the mean and
then the stdev? (But this would not work if you are reading data from
stdin and don’t want to cache the data in memory.)

Yes, to compute the standard deviation I need the mean. I fact, there
are
dependancies between the data to compute :
-stddev needs variance
-variance needs the mean, the nb of values, and the sum of the squared
values
-mean needs the sum of the values and the nb of values.

here is the hash I use for this (the nb of values are always computed) :

DEPENDANCIES = {
:histogram => [:table],
:mean => [:sum],
:variance => [:sum, :square],
:deviation => [:sum, :square],
:median => [:table],
:skewness => [:sum, :square, :cube],
:kurtosis => [:sum, :square, :cube, :quad],
} # :nodoc:

Then the method which adds a value is generated to compute these data
each
time one is added.
in this case, the generated method would be :

def add_pixel(value)
@nb_px += 1
@lum_sum += value
@lum_square_sum += value**2’
return self
end

and the mean, variance and deviation methods are avalaible at any time :

def mean
  return 0 if @lum_sum.zero?
  return @lum_sum / @nb_px.to_f
end

def variance
  return 0 if @lum_sum.zero?
  return @lum_square_sum / @nb_px.to_f - self.mean**2
end

def deviation
  return Math.sqrt(variance)
end

for each value v, we can compute any stat by computing the sum of v,
v2,
v
3 and v**4 (kurtosis needs all of them, for example)

Et voila :slight_smile:

David K. wrote:

Now while you can sum this as you go, it is not numerically stable if
you are unlucky.
Mind you, I skirted through numerical analysis in undergrad, so listen
to the man who sounds like he knows what he’s talking about. :slight_smile:

Devin

Joel VanderWerf wrote:

/ …

I was in fact disagreeing with all three parts of your statement,
premise, deduction, and conclusion:

  • A matter of minutes is not enough.

Certainly not for any kind of decent, robust application, I agree. I was
just making a point that something serviceable could be created in a
short
time. Not something for the ages, but something that would meet the
original request.

You aimed higher. I aimed to show how easy it could be. We both met our
respective goals.

There’s nothing wrong with posting code without exhaustive analysis,
unit testing, and search of prior art. I do it all the time, but I try
to have a little humility about it.

I think I have just established my humility credentials, and I should
have
done this before.

Devin M. [email protected] writes:

David K. wrote:

Now while you can sum this as you go, it is not numerically stable if
you are unlucky.
Mind you, I skirted through numerical analysis in undergrad, so listen
to the man who sounds like he knows what he’s talking about. :slight_smile:

A good on-the-fly method for calculating mean and variance is in “The
Art of Computer Programming, Vol. 2, Numerical Recipes” by Donald
Knuth. And that guy indeed knows what he is talking about.

[email protected] writes:

cat list | minmax

the main reason that i use cat in real life in situations like this is because
hitting up arrow and changing mean -> sum, for instance, is easier! :wink:

-a

$ <list mean

Steve

[email protected] writes:

continue, does anyone know a standard unix or ruby version of this?

cheers.

-a

if you want others to be happy, practice compassion.
if you want to be happy, practice compassion. – the dalai lama

Sorry this doesn’t really answer your question, but…

When doing science with piles of columns of numbers in text files, I
have got endless mileage out of JDB[1].

It’s not only the collection of commands, it’s how they painlessly
cause a group to standardize on a single format for all program
output.

% cat db
#h x y
1 4
2 5
3 6
% cat db | dbstats -q 4 x | dblistize
#L mean stddev pct_rsd conf_range conf_low conf_high conf_pct sum
sum_squared min max n q1 q2 q3
mean: 2
stddev: 1
pct_rsd: 50
conf_range: 2.4843
conf_low: -0.48434
conf_high: 4.4843
conf_pct: 0.95
sum: 6
sum_squared: 14
min: 1
max: 3
n: 3
q1: 1
q2: 2
q3: 3

| dbstats -q 4 x

0.95 confidence intervals assume normal distribution and

small n.

| dblistize

Steve

[1] JDB

William J. wrote:

This is elegance?

I can trace ara’s code in my head, and could probably find out what it’s
doing from reading the code and usage examples without knowing what it’s
supposed to do from the context of this thread or the identifiers. That
IS elegance with regards to coding.

To me it seems pompous, pretentious prolixity, but it probably appeals
to the bureaucratic mind which seeks a pretext for squandering
money taken by force from the citizens.

Right, +1 for making most anyone lookup “prolixity” in a dictionary, but
the nonsequitur unfounded straw man accusation following rather
nullifies the effect. Unless it’s a quote from a translation of some
marxist text Google can’t come up with.

The simple way:

The -terse- way. If I wanted functional style, I’d have used at least a
module with the methods for the different operations and #send - procs
are too convenient an excuse for one-liners. It also makes anything in
the toplevel binding unreachable by GC, although that’s probably not a
problem often - it could be depending on where the closures are created.
And takes more effort to find out the workings of, which is a
maintainability problem.

David V.

On 12/2/06, William J. [email protected] wrote:

This is elegance?
To me it seems pompous, pretentious prolixity, but it probably appeals
to the bureaucratic mind which seeks a pretext for squandering
money taken by force from the citizens.

William, are you intentionally being confrontational and rude?

If so, please stop.

[email protected] wrote:

cat list | mean
terribly important to do the search you are suggesting.
i’ve found that my productivity increases in an exponential way if i simply
functions and is user-extensible via the use of duck-typed filters. it’s also

5 6
mussel:~/eg/ruby/listc > ./listc minmax < input.c
mussel:~/eg/ruby/listc > cat input.d

#–{{{
MeanFilter
end
def Main(*a, &b) Main.new(*a, &b).main end
port << join(’ ')
class MultiList < Array
end
def << line
end
def << line
@sum.zip(@count){|s,c| mean << (s.to_f/c.to_f)}
def initialize
end
#–{{{
@minmax[i][0] = [ @minmax[i][0], n ].min
#–{{{
end
#–}}}
end

Here’s another short version. This can handle very large files.

ops = {
:sum, proc{|cum,cur| (cum||0) + cur},
:mean, [ proc{|cum,cur| (cum||0) + cur},
proc{|cum,count| Float(cum)/count} ],
:min, proc{|cum,cur| cum ? [cum,cur].min : cur},
:max, proc{|cum,cur| cum ? [cum,cur].max : cur},
:minmax, proc{|cum,cur|
cum ? [[cum[0],cur].min, [cum[1],cur].max] : [cur,cur] }
}
ops[:add] = ops[:sum]
ops[:avg] = ops[:mean]

op = ops[ ARGV.shift.to_sym ] or
abort “op not in #{ops.map{|k,v| k}.join(’,’)}”

count = 0; cumulative = nil; values = []
ARGF.each_line{|line| count += 1
values = line.split.map{|s|
Integer(s) rescue Float(s) rescue nil}.compact
cumulative ||= [nil] * values.size
values.each_with_index{|val,i|
cumulative[i] = Array(op)[0].call( cumulative[i], val ) }
}
puts cumulative.map{|x| op.class==Array ? op[1].call(x,count) : x}.
map{|x| Array(x).join(":")}.join(" ")

[email protected] writes:

continue, does anyone know a standard unix or ruby version of this?
What about hacking http://rubyforge.org/projects/sss/ ?