have fun putting that together. to do it you need to render, not
just parse, html!
It looks pretty easy to me. You’ll conveniently put all the noise
characters
in a different colour.
Here’s my two-minute solution:
$ cat reader.rb
src = File.read(“test.html”)
src.gsub!(/<span [^>]#ccc[^>]>([^<])</span>/i) { " " * $1.size }
src.gsub!(/ /, ’ ')
src.gsub!(/
/i, “\n”)
src.gsub!(/</?pre[^>]>/, ‘’)
puts src
$ ruby reader.rb
/ | | |
| | ___ ___ |__ __ _ _ __
| | / _ \ / _ \ | ’ \ / _ | | '__|
| | | () | | () | | |) | | (| | | |
|| / _ |. / _,| |_|
Of course you can keep changing your code, and I can keep changing mine.
But
someone who took more than two minutes over this could come up with a
much
more robust solution (e.g. dynamically working out the contrast between
foreground and background)
Anyway, once your code is deployed on a real live site, by someone other
than you, it becomes much harder to change. And the source is going to
be
available to the attacker too.
now, where i’m heading now, is using css and javascript so to
position the image and characters within the image.
Hmm - this risks making the captcha visible by fewer and fewer browsers.
OK,
so lynx wouldn’t be able to view a PNG captcha either; but you risk
locking
out a lot of mobile devices, set-top boxes and other embedded web
browsers
(which could otherwise display a PNG quite happily)
However, perhaps ASCII-art generation (as a form of unusual and
disjointed
character set) combined with server-side rendering to a PNG would get
around
that issue, save you a lot of work in obfuscating the HTML itself, and
also
be harder to parse.
two other factors in favour of ascii art
- there are tons of ocr programs out there available for free.
there are no ascii art regognition programs that i am aware of.
That’s not because it’s hard - it’s because it’s been totally pointless,
until now that is. If spammers start using ASCII art text, then there’s
an
incentive to make a reader. On the other hand, any E-mail which contains
something that looks like ASCII art could probably be classified as spam
on
that basis alone.
ASCII art is, I believe, much more suited to machine reading than a
scanned
printout. Most importantly, the characters will be on an exact
horizontal/vertical grid alignment, not rotated by a few degrees. And
also I
suspect there will probably only be a handful of legible ASCII art
character
sets to choose from.
Anyway, time will tell. If your captcha isn’t widely used, then it may
remain strong enough for a reasonable time. (That’s apart from the usual
attacks on captchas, such as redirecting them to other humans who are in
search of porn
Regards,
Brian.