scRUBYt! 0.3.4

Hey all,

I am pleased to announce that the long-awaited new release of scRUBYt!,
0.3.4 is available for download. A lot of bugs have been fixed and some
cool features scrubbed in, so be sure to check it out!

==========
scrubWHAT?

scRUBYt! is a very easy to learn and use, yet powerful Web scraping
framework based on Hpricot and mechanize (and from the next version, on
FireWatir!). It’s purpose is to free you from the drudgery of web page
crawling, looking up HTML tags, attributes, XPaths, form names and other
typical low-level web scraping woes by figuring these out from your
examples copy’n’pasted from the Web page.

=========
Changelog

  • [NEW] Script pattern; possibility to evaluate custom function on the
    input of the pattern
  • [NEW] Constant pattern; Can add constant patterns with the syntax:
    pattern ‘Hello world’, :type => :constant
  • [NEW] Text pattern; structure agnostic scraping based on labels and
    other textual clues
  • [NEW] new output method: to_flat_xml for creating feed-like flat XMLs
    instead of hierarchical ones
  • [NEW] to_flat_xml with spec delimiters splits up the concatenated hash
    results
  • [MOD] Change in the semantics of the “div[stuff]” style examples
    • divs which contain “stuff” (rather than their whole text is
      “stuff”) are matched
    • generalization is false by default
  • [NEW] Possibility to define arbitrary delimiter for to_hash (used when
    the result
    contains commas)
  • [NEW/MOD] Changes in the logging module: (Credit: Tim F.)
    • Extract the logging into a class to allow for filtering

    • Allow the logger to be set to nil (to disable logging), and have
      this as the default.
      Logging now has to be explicitly enabled, as follows:

      Scrubyt.logger = Scrubyt::Logger.new

    • Allow loggers to point to streams other than STDERR.

  • [NEW/MOD] Changes in the download pattern:
    • possibility to specify an array of files that should be ignored
      during the downloading
      (e.g. ‘nopicture.gif’)
    • Handling timeout during downloads instead of crashing
    • Fixed downloading in case the filename contains no ‘.’
    • Fixed downloading for more URL types that were not working before
  • [NEW] New option: example_type. Possibility to force example type
    (instead of leaving it to scRUBYt! to guess)
  • [NEW] Entirely new test suite using rcov; Tests are added continously;
    The goal is to achieve full coverage
  • [FIX] Fixed the infamous regexp bug which caused the pricegrabber
    scenario (among other things) to fail
  • [FIX] Do not evaluate the detail pattern twice
  • [FIX] Fixed dependencies (namely parse_tree_reloaded) and correct
    versions

=========
Read more

Some additional explanation about the new release can be found here:
http://scrubyt.org/a-hot-new-release-034-is-out-whats-new

============
In the works

Paul Nikitochkin created jscRUBYt!, which should solve the win32
problems by using the J-versions of the dependencies. I have been very
swamped recently, so didn’t have too much time to look into his code,
but I am sure this will be very helpful to a lot of you so it’s on the
short term TODO list.

Glenn G. has almost finished firescRUBYt! - scRUBYt! on FireWatir,
which is using FireWatir as the agent (rather than mechanize) to
navigate and extract data from the web page. I think this is the coolest
addition in scRUBYt!'s history ever, since it enables scraping of pages
containing AJAX/Javascript and/or different tricks which were not
possible to work around with mechanize, and parsing pages with ease
which caused Hpricot to choke and gag…

=========================
Would like to contribute?

  • If you are a coder and would like to be the part of the development
    team, contact us at scrubyt[‘maps-on’.reverse]@scrubyt.org
  • If you’d like to contribute to the documentation/how-tos/tutorials,
    check out the wiki at http://wiki.scrubyt.org.
  • If you found a bug, have suggestions or feature requests, please use
    scRUBYt!'s lighthouse tracker at http://scrubyt.lighthouseapp.com
  • If you’d like to discuss or propose features, get some help or would
    like to check out and learn from the problems of others, visit the forum
    at http://agora.scrubyt.org
  • If neither of the above, but you still would like to tell us
    something, bring us champaigne/chocolates, poke Glenn to finish
    FireWatir faster or whatever else, contact us at
    scrubyt[‘maps-on’.reverse]@scrubyt.org

H4ppy scrubbing,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org