RDig document processing error

Hi all,

Am having problems using RDig:

With this rdig config…

cfg.crawler.start_urls = [‘http://www.defensetech.org’]
cfg.crawler.include_hosts = [‘www.defensetech.org’]
cfg.index.path = ‘/my/path/to/index’
cfg.verbose = true

…I get this output:

$ rdig -c config/rdig_config.rb
/usr/local/lib/site_ruby/1.8/ferret/index/term.rb:45: warning: method
redefined; discarding old text=
/usr/local/lib/site_ruby/1.8/ferret/search/sort_field.rb:69: warning:
instance variable @name not initialized
/usr/local/lib/site_ruby/1.8/ferret/search/sort_field.rb:69: warning:
instance variable @name not initialized
lib/ferret/query_parser/query_parser.y:128: warning: method redefined;
discarding old initialize
lib/ferret/query_parser/query_parser.y:157: warning: method redefined;
discarding old parse
lib/ferret/query_parser/query_parser.y:216: warning: method redefined;
discarding old clean_string
/usr/lib/ruby/gems/1.8/gems/rubyful_soup-1.0.4/lib/rubyful_soup.rb:230:
warning: method redefined; discarding old attrs
discovered content extractor class:
RDig::ContentExtractors::PdfContentExtractor
discovered content extractor class:
RDig::ContentExtractors::WordContentExtractor
discovered content extractor class:
RDig::ContentExtractors::HtmlContentExtractor
using Ferret 0.9.0
/usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:116: warning: instance
variable @patterns not initialized
/usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:105: warning: instance
variable @patterns not initialized
added url http://www.defensetech.org
fetching http://www.defensetech.org
waiting for threads to finish…
/usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:116: warning: instance
variable @patterns not initialized
/usr/local/lib/site_ruby/1.8/rdig/url_filters.rb:105: warning: instance
variable @patterns not initialized
added url http://www.defensetech.org
error processing document http://www.defensetech.org/: undefined local
variable or method url' for #<RDig::HttpDocument:0xb7a7fbb4> Trace: /usr/local/lib/site_ruby/1.8/rdig/documents.rb:35:in initialize’
/usr/local/lib/site_ruby/1.8/rdig/documents.rb:107:in initialize' /usr/local/lib/site_ruby/1.8/rdig/documents.rb:15:in create’
/usr/local/lib/site_ruby/1.8/rdig/crawler.rb:68:in add_url' /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:51:in process_document’
/usr/local/lib/site_ruby/1.8/rdig/crawler.rb:50:in process_document' /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:28:in run’
/usr/local/lib/site_ruby/1.8/rdig/crawler.rb:25:in run' /usr/local/lib/site_ruby/1.8/rdig/crawler.rb:24:in run’
/usr/local/lib/site_ruby/1.8/rdig.rb:258:in `run’
/usr/bin/rdig:14

If anyone could tell me why @patterns and url aren’t being set, I’d
really appreciate it.

Am on Ubuntu 6.06, ruby 1.8.4, gems: rdig 0.3.0, rubyful_soup 1.0.4,
ferret 0.9.4

Many Thanks,
Steven

Hi Steven,

sorry for replying that late - I’m quite busy atm.

The error you received was because of an invalid mailto: link
which rdig failed to handle correctly.

I just uploaded RDig 0.3.1, fixing this bug.

In testing with your site I noticed that it takes quite long to parse
the index page, so you might have to set
cfg.crawler.wait_before_leave
to a higher value (20 worked for me) to prevent rdig from exiting before
the parser has finished parsing the index page.

The parsing speed of RDig is really bad for big pages (your
index page weighs around 62kB). I’d happily accept a patch adding a
faster html content extraction mechanism for RDig users to choose from
:wink:

Maybe even a special Ferret analyzer just stripping out any html tags
would do.

Regards,
Jens

On Tue, Jul 25, 2006 at 11:28:13AM +0200, Steven S. wrote:

discarding old initialize
discovered content extractor class:
variable @patterns not initialized
/usr/local/lib/site_ruby/1.8/rdig/crawler.rb:50:in `process_document’
ferret 0.9.4

Many Thanks,
Steven


Posted via http://www.ruby-forum.com/.


Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk


webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66