Statistician II (#168)

I don’t know if it was the metaprogramming that scared people away
this week, or perhaps folks are away on summer vacations. In any case,
I’m going to summarize this week’s quiz by looking at the submission
from Matthias R.. The solution is, as Matthias indicates,
unexpectedly concise. “I guess that’s just the way Ruby works.”

Matthias’ code implements the Statistician module in three parts,
each a class. Here is the first class, Rule:

class Rule
  def initialize(pattern)
    @fields = []
    pattern = Regexp.escape(pattern).gsub(/\\\[(.+?)\\\]/, 

‘(?:\1)?’).
gsub(/<(.+?)>/) { @fields << $1; ‘(.+?)’ }
@regexp = Regexp.new(‘^’ + pattern + ‘$’)
end

  def match(line)
    @result = if md = @regexp.match(line)
      Hash[*@fields.zip(md.captures).flatten]
    end
  end

  def result
    @result
  end
end

Rule makes use of regular expressions built-up as discussed in the
previous quiz, so I’m not going to discuss that here. I will point
out, though, the initialization of the @fields member in the
initializer. Note the last gsub call: it uses the block form of
gsub.

gsub(/<(.+?)>/) { @fields << $1; '(.+?)' }

As the (.+?) string is last evaluated in the block, that provides
the replacement in the string. However, makes use of the just-matched
expression to extract the field names. This avoids making a second
pass over the source string to get those fields names, and is arguably
simpler.

The match method matches input lines against the regular expression,
returning nil if the input didn’t match, or a hash if it did. Field
names (@fields) are first paired (zip) with the matched values
(md.captures), then flatten-ed into a single array, finally
expanded (*) and passed to a Hash initializer that treats
alternate items as keys and values. The end result of Rule#match,
when the input matches, is a hash that looks like this:

{ 'amount' => '108', 'name' => 'Tempest Warg' }

That hash is returned, but also stored internally into member
@result for future reference, accessed by the last method, result.

The next class is Reportable:

class Reportable < OpenStruct
  class << self
    attr_reader :records

    def inherited(klass)
      klass.class_eval do
        @rules, @records = [], []
      end
      super
    end

    def rule(pattern)
      @rules << Rule.new(pattern)
    end

    def match(line)
      if rule = @rules.find { |rule| rule.match(line) }
        @records << self.new(rule.result)
      end
    end
  end
end

This small class is the extent of the metaprogramming going on in the
solution, and it’s not much, though perhaps unfamiliar to some. Let’s
get into some of it. We’ll ignore the OpenStruct inheritance for the
moment, coming back to it later.

Everything inside the Reportable class is surrounded by a block that
opens with class << self. There is a good summary on the Ruby T.
mailing list
, but its use here can be summed up in two words:
class methods. The class << self mechanism is not strictly about
class methods, but in this context it affects similar behavior.
Alternatively, these methods could have been defined in this manner:

class Reportable < OpenStruct
  def Reportable.rule(pattern)
    # etc.
  end

  def Reportable.match(line)
    # etc.
  end

  # etc.
end

In the end, the class << self mechanism is cleaner looking, and also
allows for use of attr_reader in a natural way.

The next interesting bit is the inherited method. This is a class
method, here implemented on Reportable, that is called whenever
Reportable is subclassed (which happens repeatedly in the client
code). It’s a convenient hook that allows the other bit of
metaprogramming to happen.

klass.class_eval do
  @rules, @records = [], []
end

klass is the class derived from Reportable (i.e. our client’s
classes for future statistical analysis). Here, Matthias initializes
two members, both to empty arrays, in the scope of class klass. This
serves to ensure that every class derived from Reportable gets its
own, separate members, not shared with other Reportable subclasses.

This could be done without metaprogramming, but would require effort
from the user.

class Reportable
  # class methods here
end

class Offense < Reportable
  @rules, @records = [], []
  # rules, etc.
end

class Defense < Reportable
  @rules, @records = [], []
  # rules, etc.
end

If the client forgot to initialize those two members, or got the names
wrong, the class wouldn’t work, exceptions would be thrown, cats and
dogs living together
… you get the idea.

You might consider defining those data members in the Reportable
class itself, like so:

class Reportable
  @rules, @records = [], []

  # class methods, without inherited
end

The problem with this is that every Reportable subclass would now
share the same rules and records arrays: not the desired outcome.

In the end, the class_eval used here, called from inherited, is
the right way to do things. It provides a way for the superclass to
inject functionality into the subclass.

Getting back to functionality, Reportable#match is straightforward,
but let me highlight one line:

@records << self.new(rule.result)

If you recall, result returns a hash of field names to values. And
Reportable is attempting to pass that hash to its own initializer,
of which none is defined. This is where OpenStruct comes in.

OpenStruct “allows you to create data objects and set arbitrary
attributes.” And OpenStruct provides an initializer that takes the
hash Matthias provides, and does the expected.

data = OpenStruct.new( {'amount' => '108', 'name' => 'Tempest Warg'} 

)
p data.amount # → 108
p data.name # → Tempest Warg

By subclassing Reportable from OpenStruct, all of the client’s
classes will inherit the same behavior, which fulfills many of the
requirements provided in the class specification.

The final class, Reporter, is pretty trivial. It reads through a
data source a line at a time, finding a matching rule (and creating
the appropriate record in the process) or adding the input line to
@unmatched which the client can query later.

Next week we’ll take a short break from the Statistician for some
simple stuff. (Part III of Statistician will return in the not-distant
future.)

I wanted to add one more note…

      klass.class_eval do
        @rules, @records = [], []
      end

Considering that this bit of code injects @rules and @records into
klass, my preference is that they be named something less
straightforward. My own, similar solution used @reportable_rules and
@reportable_records.

The reason? There is nothing preventing a client from further
extending their own subclasses of Reportable. Actually, I will lightly
encourage that in part 3. To avoid potential name conflicts with
client-side extensions, I’d go with names more complex than the simple
@rules and @records.