Hi everyone,
I’ve been exploring using ferret for indexing large amounts of
production
log files. Right now we have a homemade system for searching through
the
logs that involves specifying a date/time range and then grepping
through
the relevant files. This can take a long time.
My initial tests (on 2gb of log files) have been promising, I’ve taken
two
separate approaches:
The first is loading each line in each log file as a “document”. The
plus
side to this is that doing a search will get you individual log lines as
the
results, which is what I want. The downside is that indexing takes a
long
long time and the index size is very large even when not storing the
contents of the lines. This approach is not viable for indexing all of
our
logs.
The second approach is indexing the log files as documents. This is
relatively fast, 211sec for 2gb of logs, and the index size is a nice
12% of
the sample size. The downside is that after figuring out which files
match
your search terms, you have to crawl through each “hit” document to find
the
relevant lines.
For the sake of full disclosure, at any given time we keep roughly 30
days
of logs which comes to about 800ish Gb of log files. Each file is
roughly
15Mb in size before it gets rotated.
Has anyone else tackled a problem like this and can offer any ideas on
how
to go about searching those logs? The best idea I can come up with
(that I
haven’t implemented yet to get real numbers) is to index a certain
number of
log files by line, like the last 2 days, and then do another set by file
(like the last week). This would have fast results for the more recent
logs
and you would just have to be patient for the slightly older logs.
Any ideas/help?
Thanks,
Chris