Issues with index for table with over 18 million records

cpjolicoeur · August 8, 2007, 10:16pm

I have a MySQL table with over 18 million records in it. We are
indexing about 10 fields in this table with ferret.

I am having problems with the initial building of the index. I created
a rake task to run the “Model.rebuild_index” command in the background.
That process ran fine for about 2.5 days before it just suddenly
stopped. The log/ferret_index.log file says it got to about 28% before
ending. I’m not sure if the process died because of something on my
server or because of something related to ferret.

It appears that it will take close to 10 days for the full index to be
build with rebuild_index? Is this normal for a table of this size?
Also, is there a way to start where the index ended and update from
there instead of having to rebuild the entire index from scratch? I got
about 28% of the way through so would like to not have to waste the 2.5
days to rebuild that part again trying to get the full index 100% built.

Also, is there a way that I can non-destructive rebuild the index since
it didnt complete 100%? Meaning, can I rebuild it without overwriting
what is already there? That way I can keep what I have to be searched
while the rebuild takes place and then move that over the old index?
I’m not running ferret as a Drb server so I dont know if I can.

Also, is there a faster or better way that I can/should be building the
index? Will I have an issue with the index file sizes with a DB this
size?

cpjolicoeur · August 8, 2007, 10:51pm

We have a 1 million record index that is about 6GB in size. We build
it in parallel w/out AAF so it’s hard to comment on the speed of your
index build. However I will say that I did need to manually patch
Ferret to better handle large indexes.

Here is the diff:

— /usr/local/lib/ruby/gems/1.8/gems/ferret-0.11.4/ext/index.c
+++ index.c
@@ -1375,7 +1375,7 @@
lazy_doc = lazy_doc_new(stored_cnt, fdt_in);
for (i = 0; i < stored_cnt; i++) {

```
   int start = 0, end, data_cnt;
```

   off_t start = 0, end, data_cnt;
    field_num = is_read_vint(fdt_in);
    fi = fr->fis->fields[field_num];
    data_cnt = is_read_vint(fdt_in);

@@ -1449,7 +1449,7 @@
if (store_offsets) {
int num_positions = tv->offset_cnt = is_read_vint(fdt_in);
Offset *offsets = tv->offsets = ALLOC_N(Offset,
num_positions);

```
       int offset = 0;
```

       off_t offset = 0;
        for (i = 0; i < num_positions; i++) {
            offsets[i].start = offset += is_read_vint(fdt_in);
            offsets[i].end = offset += is_read_vint(fdt_in);

@@ -1683,8 +1683,8 @@
int last_end = 0;
os_write_vint(fdt_out, offset_count); /* write shared
prefix length */
for (i = 0; i < offset_count; i++) {

```
       int start = offsets[i].start;
```
```
       int end = offsets[i].end;
```

```
       off_t start = offsets[i].start;
```

       off_t end = offsets[i].end;
        os_write_vint(fdt_out, start - last_end);
        os_write_vint(fdt_out, end - start);
        last_end = end;

@@ -4799,7 +4799,7 @@
*

****/
-Offset *offset_new(int start, int end)
+Offset *offset_new(off_t start, off_t end)
{
Offset *offset = ALLOC(Offset);
offset->start = start;

cpjolicoeur · August 8, 2007, 10:53pm

Erik M. wrote:

We have a 1 million record index that is about 6GB in size. We build
it in parallel w/out AAF so it’s hard to comment on the speed of your
index build. However I will say that I did need to manually patch
Ferret to better handle large indexes.

Erik,

What issues did you find that caused you to patch the ferret code?

ALso, you say you build the index in parallel w/out AAF; how do you do
that? Not sure I’m following how to do that so if you can explain, I’d
appreciate it.

cpjolicoeur · August 8, 2007, 11:17pm

We had to patch it because we were getting seemingly random errors
while searching a 2GB+ index. This the trac ticket: http://
ferret.davebalmain.com/trac/ticket/215. The patch I included changes
some ints to off_t’s, which solved the problem. As far as I know this
patch was never applied to the trunk.

We build our index using a modified version of RDig. We basically run
up to 80 EC2 servers in parallel to create 80 separate indexes, which
we later combine into a single index. You could follow a similar
route and still have AAF mange the index after it is built. You’d
need to make sure that the documents created by RDig/whatever have
the same fields that AAF expects.

Erik