Vehicle Counters (#146)

bbazzarrakk · November 15, 2007, 1:57pm

This is a cool little data mining problem. Unfortunately, it is a fair
bit of
work to get a good solution. Justin Either sent in the only full
solution and
even he chose to cut a few corners. His code does produce almost 200
lines of
interesting feedback when run on the sample data set though, so it’s fun
to play
around with. The downside is that it’s close to 300 lines of code.
Because of
that, I’m going to try giving a pretty high-level overview of how it
works.

Let’s work backwards this time and just follow the code:

…

if ARGV.size < 1
puts “Usage: vehicle_counter.rb datafile [-avg]”
else
vc = VehicleCounter.new
vc.process(ARGV[0], ARGV[1] != nil)
end

This is the code that actually gets executed when you run the program.
I always
call that the application code.

Here we see a simple branch that either presents a usage message or
builds a
VehicleCounter and calls process() on it. This could actually be just a
module,
since it doesn’t maintain any state. It may help you to think of the
following
code that way:

…

class VehicleCounter
# …

def process(file, report_averages)
  raw_data = parse(file)
  for sensor in 0..1
    sensor_report = DirectionDataReport.new(raw_data[sensor])
    sensor_report.report(sensor, report_averages)
  end
end

# ...

end

…

This defines the overall process of this program which is just two
steps: read
in the data and build a report. The parse() method manages the reading
step
while reporting is encapsulated in DirectionDataReport objects. We will
begin
with the reading:

# ...

Dir_A = 0
Dir_B = 1

def parse(file)
  times = []
  dirs = [[], []]

  f = File.open(file)
  f.each_line do |line|
    sensor, time = parse_record(line)
    times << time

    if (times.size % 2) == 0
      if (times.size == 2 and sensor == Dir_A) or
         (times.size == 4 and sensor == Dir_B)

         # Remove "B" records from second direction
         times = [times[0], times[2]] if sensor == Dir_B

         dirs[sensor] << times
         times = []
       elsif (times.size == 4 and sensor != Dir_B)
        puts "Parse error"
        times = []
       end
    elsif (times.size % 2) == 1 and sensor == Dir_B
      puts "Parse error - Unexpected B record"
    end
  end
  f.close

  dirs
end

def parse_record(data)
  unpacked = data.unpack("a1a*")
  return unpacked[0] == "A" ? Dir_A : Dir_B, unpacked[1].to_i
end

# ...

This parser is what I meant by cutting corners earlier. The data
provided is
quite simple and you don’t have to worry about cars going both
directions at the
same time when working with it. Given that, a pair of A times is a
northbound
car and a set of A, B, A, B times is a southbound car. This parse just
hunts
for those sets.

Both sets of times are maintained in the Array of Arrays called dirs.
The
constants are used to select the right group based on the indicator
found in
unpack()ing the last record of the group. These two Arrays are the
ultimate
results of this method.

Note that the file reading loop could be reduced to a File.foreach()
call.

The rest of the code is reporting. We saw a DirectionDataReport
initialized in
the process() method and then witnessed a call to report() on that
object.
These two steps are the primary interface:

MSecPerMin = 1000 * 60
MSecPerHour = MSecPerMin * 60
InchesPerMile = 63360

class DirectionDataReport
Verbose = false
Sensors = %w(Northbound Southbound)
Days = %w(Mon Tue Wed Thu Fri Sat Sun)

def initialize(raw_data)
  @raw_data = raw_data
end

def report(sensor, report_averages)
  puts "Direction: #{Sensors[sensor]}"
  puts "Total Cars: #{self.total_count}"

  report_time_periods(sensor, report_averages, MSecPerHour * 12)
  report_time_periods(sensor, report_averages, MSecPerHour)
  report_time_periods(sensor, report_averages, MSecPerHour / 2)
  report_time_periods(sensor, report_averages, MSecPerHour / 4)
  puts ""
end

def total_count()
  @raw_data.size
end

# ...

You can see here that the constructor just squirrels away the data from
the
parse. The report() method, on the other, hand drives the output
process.
After reporting the direction and a total car count, it displays time
oriented
breakdowns for various scales: 12 hours, one hour, 30 minutes, and 15
minutes.

Here is the report_time_periods() method and three of the methods it
relies on:

# ...

def report_time_periods(sensor, report_averages, time_period_length)
  days = create_time_periods(time_period_length)
  num_time_periods = (MSecPerHour * 24) / time_period_length

  counts = count_per_time_period(days)
  avg_speeds = speed_per_time_period(days)
  avg_dists = dist_per_time_period(days)

  puts("\nTime Interval: #{time_period_length/MSecPerMin} Minutes")
  if (num_time_periods > 2)
    peaks = find_peak_times(days)
    puts("Peak Times")
    for i in 0...peaks.size
      printf("#{Days[i]}:")
      peaks[i].size.times {|p|
        printf(format_time_interval_index(peaks[i][p][1],
                                          time_period_length))
      }
      puts ""
    end
  end

  puts("Statistics")
  printf("Data    ")
  printf("\tDay") if not report_averages

  num_time_periods.times{|i|
    printf(format_time_interval_index(i, time_period_length))
  }
  report_column_data(days, num_time_periods, report_averages,

counts,
report_averages ? “Avg Count” : "Count ", “%
5d”)
report_column_data(days, num_time_periods, report_averages,
avg_speeds,
“Avg Speed”, “%02.02f”)
report_column_data(days, num_time_periods, report_averages,
avg_dists,
"Avg Dist ", “%02.02f”)
puts “”
end

def create_time_periods(time_period_length = MSecPerHour)
  days = []
  time_periods = nil
  prev_start = @raw_data[0][0] + 1

  for data in @raw_data
    if prev_start > data[0]
      days << time_periods if time_periods != nil

      cur_time_period = 0
      time_periods = [[]]

      puts "New day: data=#{data[0]}" if Verbose
    elsif data[0] > ((cur_time_period + 1) * time_period_length)
      while data[0] > ((cur_time_period + 1) * time_period_length)
        cur_time_period += 1
        time_periods[cur_time_period] = []
      end

      puts "New time period: data=#{data[0]}" if Verbose
    end

    time_periods[cur_time_period] << data
    prev_start = data[0]
  end

  days << time_periods
  days
end

def format_time_interval_index(index, time_period_length)
  sprintf("\t%02d:%02d",
      index * time_period_length / MSecPerHour,
      (index * time_period_length / MSecPerMin) % 60)
end

def report_column_data(days, num_time_periods, report_averages,

data,
data_label, format_string)
if report_averages
printf("\n#{data_label}")
for time in 0…num_time_periods
avg = 0
days.size.times {|day| avg += data[day][time] }
printf("\t#{format_string}", avg / days.size)
end
else
for day in 0…days.size
printf("\n#{data_label}\t#{Days[day]}")
for time in 0…num_time_periods
printf("\t#{format_string}", data[day][time])
end
end
end
end

# ...

There is a lot of code here obviously, but none of it is very complex.
First,
report_time_periods() breaks the data down into time intervals using
create_time_periods(). That method is just a partition tool for the
data. Once
it’s divided out, a few utility method we will examine next are used to
separate
the interesting information out.

After that, the entire rest of the of the method is printing code for
what it
found. (The peak data section actually calculates, again using another
helper,
and prints for appropriate data ranges.) Most of the print work is
delegated to
the bottom two methods and much or their magic is for printing columnar
data.

Here are the helpers I glossed over and the helpers they rely on:

# ...

def find_peak_times(days, num_peaks=4)
  days.map do |day|
    find_daily_peak_times(day, num_peaks)
  end
end

def find_daily_peak_times(daily_time_periods, num_peaks)
  peaks = []
  daily_time_periods.size.times {|i|
    peaks << [daily_time_periods[i].size, i]
  }
  peaks.sort.reverse.slice(0, num_peaks).sort {|a,b| a[1]<=>b[1]}
end

def count_per_time_period(days)
  days.map do |day|
    day.map {|time_period| time_period.size}
  end
end

def speed_per_time_period(days)
  days.map do |day|
    day.map {|time_period| calc_average_speed(time_period) }
  end
end

def dist_per_time_period(days)
  days.map do |day|
    day.map {|time_period| calc_average_distance(time_period) }
  end
end

def calc_average_speed(time_period)
  return 0 if time_period.size == 0

  sum = 0
  for time in time_period
    sum += calc_speed(time[0], time[1])
  end
  sum / (time_period.size)
end

def calc_speed(start_time, end_time)
  return (100.0 / (end_time - start_time)) *
         MSecPerHour / (InchesPerMile * 1.0)
end

def calc_average_distance(time_period)
  return 0 if time_period.size <= 1 # Need at least 2 cars
  sum = 0
  for i in 0...(time_period.size - 1)
    sum += calc_distance(time_period[0], time_period[1])
  end
  return sum / (time_period.size - 1)
end

def calc_distance(leader_time, follower_time)
  follower_speed = calc_speed(follower_time[0], follower_time[1])

  dist = follower_speed * ((follower_time[0] - leader_time[0]) /
         (MSecPerHour * 1.0))

  return dist
end

end

…

There are three kinds of helpers here. The peak methods just sort the
data and
take so many entries off the top. The per_time_period() method are
wrappers
that apply the calc() methods over periods of data. Which means the
calc_*()
methods are where the action is. They figure out car speeds, distances
between
cars, and the averages for both using simple math. This is the primary
data
that we see in the report output.

Again it’s a lot of code to cover, but Justin has produced a great end
result.
I’m not going to include the output here because it too is large, but go
run the
program and take a look at what it builds.

My thanks to Justin for the heroic effort and Gavin for the interesting
problem.

The Ruby Q. began with an encoding problem that came out of a book and
tomorrow we have another one for you…