I have a data-mining task which loads data as a big XML tree (10+ MB)
and then reorganizes it. Even loading it with Hpricot takes 10-20
seconds. I don’t want to do it for every manilupation I want to try,
especially for sequences of transformations.
Thus I wonder what’s a good way to keep the huge object in memory
between the runs of querying scripts. Can Rails be used for that?
I’d rather avoid writing a client-server platform, or using it per se,
unless there’s already an existing one. A vague intuition is, it
should be something like threads – one thread parses XML and keeps it
in memory, another starts up later, somehow joins the memory space of
the first one, queries/transforms it, and ends. Then other queries/
transformations can all be run. Is there anything like it?
in memory, another starts up later, somehow joins the memory space of
the first one, queries/transforms it, and ends. Then other queries/
transformations can all be run. Is there anything like it?
That’s just plain serialization, isn’t it? I’ve seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)
The question is, can we keep an object in memory in one thread, and
explore/change it from another? In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal – I guess a crude solution is forming here, using shared
memory or RAM disk – have to see what’s there for macs… But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.
quickly dump an object into a memory region and reload it back via
Marshal – I guess a crude solution is forming here, using shared
memory or RAM disk – have to see what’s there for macs… But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.
Aren’t you overengineering a little? You want to amortize a ten-second
startup cost over a (presumably) large number of operations against some
dataset. But you keep talking about threads. That tells me that your
process
will run for a long time and will know all the operations it has to
execute
upfront. In that case, forget about threads and just serialize your
operations. Your life will be much simpler.
But on the other hand, you talk about shared memory and about not
wanting to
write a client/server application. That suggests that you’re thinking of
keeping this dataset around and having other PROCESSES sent requests to
it
at arbitrary times. In that case, don’t use threads either, or
shared-memory
for that matter. Life is too short to debug all that stuff. Write
yourself a
little client-server application and be done with it. If you don’t want
to
deal with the network programming, use EventMachine.
That’s just plain serialization, isn’t it? I’ve seen that and
Madelaine; but my wish is to keep the objects in memory without the
need to dump/reload it, however fast. (That would be a last resort.)
I find that odd. Keeping something in memory is usually a solution
for some kind of business requirement (e.g. to make things fast). Why
would you want to keep something in mem if it can be persisted on disk
really fast? I don’t know the volume of what you need to handle but did
you actually try out how fast it is?
The question is, can we keep an object in memory in one thread, and
explore/change it from another?
Yes, of course. Easily sharing memory is one (if not the) major
aspect of multithreaded applications. But reading your other posting I
am not sure whether you have the proper idea of MT programming. If you
only want to do one set of manipulations at a time you do not need
multiple threads because there is no concurrency involved.
In the worst case, we can probably
quickly dump an object into a memory region and reload it back via
Marshal – I guess a crude solution is forming here, using shared
memory or RAM disk – have to see what’s there for macs… But still
I wonder what folks think in terms of all kinds of RAM persistence in
ruby solutions.
As James suggested using DRb is one option. Then you can decide whether
to manipulate the object graph in the server process or send it off to
the client (and probably send it back after doing your changes). It’s
probably the best solution in your case because you can start arbitrary
client processes and manipulate state in the server. But you should
make sure that access is proper synchronized to cope with multiple
clients that connect concurrently.