After a few optimization passes, I’ve finally figured out how to get
the optimal POSIX pangrams in about 30 minutes on a fast, modern
machine. I am still relatively new to Ruby so this application is
surely not as concise as it could be. To run it, save the two files,
create a posix-utilities.txt with the utilities, one per line, and run
the unit test.
– pangrams.rb
A set of routines to find minimal pangrams from an input set of
words. A pangram is a set of
words that contain all the letters of the alphabet. Unless there are
some properties of
pangrams that I am unaware of, finding the minimal pangrammatic
subset of a set of words
is in a family of constraint-satisfaction problems that is
NP-complete, as in, there is
not a way to find the minimal subset that is guaranteed to run in
polynomial time with respect
to the size of the dataset.
The most common ways of finding excellent, but not provably perfect
solutions are to either
use a non-deterministic algorithm, or to try to prune the number of
subsets that you search
as much as possible, and backtrack as often as possible, quitting
after a pre-determined
quality threshold has been reached.
The Pangrams class below implements both of these algorithms,
providing methods to yield a
specified number of random pangrams, or to do a backtracking search
to yield successively
better pangrams.
In order not to waste time trying non-pangrams, the strategy for
building pangrams is to
build a list by finding the least common letter not already there,
and then searching for pangrams
using the utilities containing the least-common missing letter. If at
any point the string
is longer or has more repeated characters than the known minimums
then we stop searching and
backtrack. So we abort early and try to limit the number of decisions
to try at each step.
The search algorithm does not run in polynomial time, but for 160
elements, it can search the
problem space in about 30 minutes on a fast, modern machine
(dual-core xeon). For larger datasets
(like aspell) the non-deterministic approach would be more practical.
Two minimal POSIX pangrams are:
Minimal number of utilities: "zcat stty jobs cxref newgrp iconv
cksum qhold"
Minimal repeated letters: “zcat tty jobs lex awk mv uniq chgrp df”
(4 repeats)
(This is one of the ones that Christian found, in a different
order)
Takes a list of words and creates a histogram of letters. The
histogram can then
be queried for pangramness and repeated letter counts.
class LetterHistogram
Initializes with a set of words
def initialize(words=nil)
@hist = Hash.new(0)
@total_letters = 0
words.each {|word| add word} unless words.nil?
end
Adds a word to the pangram
def add(word)
word.scan(/./) {|l| @hist[l] = @hist[l] + 1 if ('a'..'z') === l}
@total_letters += word.size
end
Returns true if this list of words has a pangram
def pangram?
return @hist.size == 26
end
Returns the number of repeated letter
def repeats
@total_letters - @hist.size
end
Returns a list of missing letters. Used in conjunction with the
WordLetterMap below to
find a good word to add to the list to make a pangram.
def missing_letters
missing = Array.new
('a'..'z').each {|l| missing << l if @hist[l] == 0}
missing
end
end
Contans a map of Letters => Words containing that letter
class WordLetterMap
def initialize(words)
@map = Hash.new
words.each {|w| add w}
end
Adds a new word
def add(word)
word.scan(/./) do |l|
if ('a'..'z') === l
@map[l] = Array.new unless @map.has_key? l
@map[l] = @map[l] << word
end
end
end
Return a list of words containing the least common missing letter
Used to limit the number of choices as we search the minimal
pangram space
def least_common(words=nil, histogram=nil)
histogram = LetterHistogram.new words if histogram.nil?
min_words = nil
histogram.missing_letters.each do |l|
new_words = @map[l] - words
if min_words.nil? || min_words.size > words.size
min_words = new_words
end
end
return min_words
end
end
Holds a list of words and generates minimal pangrams, both
non-deterministically,
and by searching, backtracking to avoid non-minimal branches
class Pangrams
Exception to throw when we have returned the maximum number of
pangrams
class AllDone < Exception
end
Initialize with a word list
def initialize(words=nil)
if words.nil?
@words = Array.new
else
@words = words
end
end
Adds a word to the set of words to produce pangrams
def add(word)
@words[@words.size] = word
end
The number of words in the set of words
def size
@words.size
end
Loads a list of words from a file
def self.from_file(filename)
p = Pangrams.new
File.new(filename).each_line {|line| p.add line.strip}
return p
end
Yields randomly generated minimal pangrams to the passed block
def random(count,&block)
@word_letters = WordLetterMap.new @words
(0..count).each {|i| random_pangram([],&block)}
end
Searches for good minimal pangrams, yielding the ones it finds to
the block
def search(max_count=0,&block)
@min_size = size
@min_repeats = 1000
@max_count = max_count
@count = 0
@word_letters = WordLetterMap.new @words
begin
pangram_search([],&block)
rescue AllDone
# Quit gracefully
end
end
private
searches the pangram space by finding all words containing the
least common letter until
a pangram has been built, passing each found pangram to the block.
The algorithm tries to
be smart searching the word space by choosing the least common
letters first, making the
overall space smaller. It also tries make smart use of backtracking
to avoid searching
non-lucrative branches.
def pangram_search(words, &block)
# Bust out if we've found enough pangrams
raise AllDone.new if @max_count != 0 && @count > @max_count
h = LetterHistogram.new words
# If we already have more words or more repeats, then no need to
look any
# further, we should backtrack and try something else.
return if words.size >= @min_size && h.repeats >= @min_repeats
# This pangram is somehow minimal, so pass to the block
if h.pangram?
@min_size = words.size if words.size < @min_size
@min_repeats = h.repeats if h.repeats < @min_repeats
@count += 1
yield words,h
return
end
# No pangram yet, find children and descend
new_words = @word_letters.least_common words,h
new_words.each {|w| pangram_search words + [w], &block}
end
Builds a pangram by finding the least common letter that is not
represented in the
passed-in array, and recursing until a pangram is built. This
algorithm should build
minimal pangrams almost all of the time (i.e. removing a word from
the set will make
the word list non-pangrammatic and will yield very good pangrams,
but probably not the
most optimal ones.
def random_pangram(words, &block)
# Do we already have a pangram? If so, pass to block and quit
h = LetterHistogram.new words
if h.pangram?
yield words,h
return
end
# Be non-deterministic, and descend on a random word
new_words = @word_letters.least_common words,h
new_word = new_words[rand(new_words.size)]
random_pangram words + [new_word], &block
end
end
– test_pangrams.rb
require ‘test/unit’
require ‘pangrams’
Unit test harness to test the histogram and the frequency map code,
and also to exercise the pangram
generation code. The tests assume a list of posix words, one per
line, in ‘posix-words.txt’
class TestPangrams < Test::Unit::TestCase
def test_histogram
l = LetterHistogram.new ['moon']
assert_equal false, l.pangram?
assert_equal 1, l.repeats
pangram = %w{the quick brown fox jumps over the lazy dog}
l = LetterHistogram.new pangram
assert_equal true, l.pangram?
l = LetterHistogram.new ['qwertyuiopasdfghjklzxcvbn'] # m is
missing
m = l.missing_letters
assert_equal 1, m.size
assert_equal 'm', m.first
end
def test_random_search
puts "-- Testing Random Pangram Generation --"
p = Pangrams.from_file 'posix-words.txt'
min_repeats_length = p.size
min_repeats = 1000
min_repeats_pangram = nil
min_size = 160
min_size_repeats = 1000
min_size_pangram = nil
count = 0
p.random(1000) do |p, hist|
repeats = hist.repeats
size = p.size
if repeats < min_repeats || repeats == min_repeats && size <
min_repeats_length
min_repeats_pangram = p
min_repeats = repeats
min_repeats_length = p.size
puts "New min-repeats pangram: #{min_repeats_pangram.join ' '}
with #{min_repeats} repeats"
$stdout.flush
end
if size < min_size || size == min_size && repeats <
min_size_repeats
min_size = size
min_size_repeats = repeats
min_size_pangram = p
puts "New min-size pangram: #{min_size_pangram.join ' '} with
#{min_size} words"
$stdout.flush
end
end
end
def test_backtracking_search
puts "--Testing backtracking search--"
p = Pangrams.from_file 'posix-words.txt'
min_repeats_length = p.size
min_repeats = 1000
min_repeats_pangram = nil
min_size = 160
min_size_repeats = 1000
min_size_pangram = nil
p.search(50) do |p, hist|
repeats = hist.repeats
size = p.size
if repeats < min_repeats || repeats == min_repeats && size <
min_repeats_length
min_repeats_pangram = p
min_repeats = repeats
min_repeats_length = p.size
puts "New min-repeats pangram: #{min_repeats_pangram.join ' '}
with #{min_repeats} repeats"
$stdout.flush
end
if size < min_size || size == min_size && repeats <
min_size_repeats
min_size = size
min_size_repeats = repeats
min_size_pangram = p
puts "New min-size pangram: #{min_size_pangram.join ' '} with
#{min_size} words"
$stdout.flush
end
end
end
end