Should *most* memory be release back to the system?

On 10/18/07, Yohanes S. [email protected]
wrote:

I don’t favour the long-running process model for server. I prefer to
fork() for each request. So I’m rarely bothered by whatever ruby’s GC
quirknesses that I may have triggered. I understand that this approach
is not trendy anymore and RoR does not support this model, but I’m
just throwing it out in the open for an alternative work-around where
possible.

Hi Yohanes: most Rails deployments use a process pool but with longer
lifetime than a single request. You can let the process exit after
max_child_requests and rely on the parent to fork or the process
supervisor to respawn. Perhaps it isn’t trendy? but this approach is
common and works well for both Ruby and Rails.

Also of interest, Hongli L. has recently done some work to make Ruby
GC copy-on-write friendly and thus more attractive to fork:
http://izumi.plan99.net/blog/

Best,
jeremy

On Fri, 19 Oct 2007 04:13:58 +0900, Joel VanderWerf
[email protected] wrote:

Yohanes S. wrote:

mallopt(M_MMAP_THRESHOLD, 0); /* declared in malloc.h */

Very interesting. I’d like to use this as a diagnostic.

I patched ruby’s main.c to call mallopt() before anything else. It seems
to be using a huge amount of memory, though. Is this normal?

Using mmap for individual allocations means that each allocation
gets rounded up to the nearest multiple of the page size
(typically 4k).

That aside, I’m not sure it’s that useful as a diagnostic; note
that only M_MMAP_MAX chunks will ever be allocated at a time with
mmap; further allocations beyond that are allocated the normal way
by advancing sbrk.

-mental

MenTaLguY [email protected] writes:

gets rounded up to the nearest multiple of the page size
(typically 4k).

That aside, I’m not sure it’s that useful as a diagnostic; note
that only M_MMAP_MAX chunks will ever be allocated at a time with
mmap; further allocations beyond that are allocated the normal way
by advancing sbrk.

-mental

I suppose all these could be offered as an extension module. At the
very least allow: Malloc.m_mmap_threshold=(),m_mmap_max=(), and
stat() which returns wrapped struct mallinfo.[1] This is so you don’t
have to pay for allocations not done by your app like allocations
internal to ruby and/or RoR.

From struct mallinfo, it seems you can get the actual memory use with
hblkhd + uordblks which would provide yet another way to diagnose the
raising VSZ problem. Perhaps mental can confirm this?

YS.

Footnotes:
[1]

ara.t.howard wrote:

fork() (or clone() in Linux) is cheap … it’s actually instantiating
the thread or process that costs! Depending how smart your kernel is,
you could be doing it one page fault at a time. And no matter how
smart your kernel is, above a certain ratio of virtual process size over
real process size, it’s going to start thrashing. Pay me now or pay me
later, etc.

That’s what’s so attractive about lightweight communicating processes –
emphasis on lightweight. It doesn’t cost much to start them up, move
them around, kill them, etc.

On Fri, 19 Oct 2007, M. Edward (Ed) Borasky wrote:

fork() (or clone() in Linux) is cheap … it’s actually instantiating the
thread or process that costs! Depending how smart your kernel is, you could

This is just a side note, but the sentence above reminded me of it.

Some time ago I wrote a Mongrel variation that used fork() on incoming
requests instead of spawning a thread. Throughput on it was lousy,
comparatively. Somewhere around an order of magnitude worse than using
Ruby threads.

It would work for modest volume sites, but there was a large response
time
tax imposed by forking versus using the Ruby threads. This was tested
on
a Linux box, though it was an older (2.4.x kernel).

Kirk H.

2007/10/19, [email protected] [email protected]:

Ruby threads.

It would work for modest volume sites, but there was a large response time
tax imposed by forking versus using the Ruby threads. This was tested on
a Linux box, though it was an older (2.4.x kernel).

Did you fork for every request? If so then it seems there might be a
more optimal solution (starting worker processes and then sending off
requests to them via DRb for example).

Cheers

robert

[email protected] wrote:

comparatively. Somewhere around an order of magnitude worse than using

Yeah … I should have been more explicit. When you do a fork/clone in
Linux, an “empty” process/thread is created. You get a task control
block and an empty memory map and that’s about it. That doesn’t take
very much time or space.

But when you actually want that process/thread to do something, its code
(text) pages have to be given page frames and loaded into RAM, which I
called “instantiating”. And those code pages refer to data pages and
those have to be given page frames in RAM, they read data from disk
and those pages have to be given page frames in RAM, etc. That’s
“demand paging” – nothing happens until an instruction gets a page
fault, unless you count the kernel’s lookahead mechanisms.

In the high-level view, most “modern” operating systems – Solaris,
Windows, Linux and BSD/MacOS – work the same way. There are minor
variations on what things are called and various tuning knobs, but
essentially you have pages on disk, page frames in RAM,
page-fault-driven on-demand movement of code and data into RAM and some
background processes/daemons/kernel threads that try to maintain a
balance of all the many demands for page frames.

When it works, it works well, and when it doesn’t work, it fails
spectacularly – disk thrashing, out-of-memory process killers, response
times on the order of minutes for one-second tasks, freezing screens,
etc. And the solution is to add more RAM or have the software use less
RAM.

Now the killer is this: the platform (hardware and OS) designers make a
bunch of compromises so that you can get “acceptable” performance for a
lot of different languages – compiled or interpreted, static memory
allocation or dynamic memory allocation, explicit memory
allocation/deallocation or garbage collection, etc. And the language
designers make a bunch of compromises so that you can get “acceptable”
performance on modern operating systems. It’s almost as if the two types
of designers communicate with each other only every fifteen years or so.

What’s even more interesting is that proposals to change this – to
integrate language design and platform design – almost always fall back
to an experiment that was tried and failed (commercially, not
technically): Lisp machines. :slight_smile:

“M. Edward (Ed) Borasky” [email protected] writes:

that’s quite interesting because, while i’m not the memory expert
you are, i’ve settled on exactly that model for the many many server
process i’ve written for 24x7 systems: the robustness simply cannot
be beaten.

Ara, my knowledge is limited to whatever few ad-hoc experimentations
I’ve done.

Ed,

fork() (or clone() in Linux) is cheap … it’s actually
instantiating the thread or process that costs!

What do you mean by ‘instantiating’? When you fork() a new process is
created and scheduled. That seems instantiated enough for me.

Depending how smart your kernel is, you could be doing it one page
fault at a time.

And no matter how smart your kernel is, above a certain ratio of
virtual process size over real process size, it’s going to start
thrashing.

Do you have an example? I don’t quite get what you meant. I’m not sure
why the ratio value of VSZ over real process size (I assume it’s not
RSZ which is the resident size) matters. Can the approximate value of
this ratio be determined?

My understanding is a process thrashes because its working set during
the thrashing period cannot be paged in in its entirety. This could be
because of limited resources (pressure from other processes, etc.) or
bug in the kernel.

That’s what’s so attractive about lightweight communicating
processes – emphasis on lightweight. It doesn’t cost much to
start them up, move them around, kill them, etc.

Regards,
YS.

On 19/10/2007, M. Edward (Ed) Borasky [email protected] wrote:

times on the order of minutes for one-second tasks, freezing screens,
etc. And the solution is to add more RAM or have the software use less RAM.

Well, the memory subsystem is quite underdeveloped on the “general
purpose” OSes. You normally do not get resource accounting unless you
do realtime or some specialized OS but you at least get priorities for
cpu time. Nothing like that for memory. It is all just best effort,
distributed more or less proportionally to the amount of pages the
process has touched recently, and when it runs out something randomly
breaks.

Now the killer is this: the platform (hardware and OS) designers make a
bunch of compromises so that you can get “acceptable” performance for a
lot of different languages – compiled or interpreted, static memory
allocation or dynamic memory allocation, explicit memory
allocation/deallocation or garbage collection, etc. And the language
designers make a bunch of compromises so that you can get “acceptable”
performance on modern operating systems. It’s almost as if the two types
of designers communicate with each other only every fifteen years or so.

I cannot imagine what else you can do when you want an OS that runs
pretty much all languages. All that the OS can do is hand out pages,
and only the language runtime can manage the data inside those pages.
Unless you tailor the OS to one specific language or virtual machine
you cannot get anything more.

The POSIX interface might make it easier to allocate through growing
the heap rather than allocating individual pages. But still mapping
individual pages only helps in the situation when you have one huge
hole (which can be swapped out anyway), and data at the end of the
heap. This is a just special case of fragmentation. Clever allocators
can make fragmentation less likely and less severe but in the end you
cannot completely fix it unless you have a means of condensing your
data on your heap. And that you must do yourself, the OS cannot do
that. A VM may do it for you if you use an interpreted language. You
could even modify your C compiler and runtime to use indirect pointers
but then you would lose the single benefit of C - binary
compatibility.

What’s even more interesting is that proposals to change this – to
integrate language design and platform design – almost always fall back
to an experiment that was tried and failed (commercially, not
technically): Lisp machines. :slight_smile:

Well, that’s where you get if you manage the language objects in the
OS (assuming that a lisp machine is the thing where you basically run
lisp runtime on the bare metal). It’s perfectly integrated but you
lose the ability to run other languages easily because you have to map
them somehow to your chosen language. For some that are similar enough
it might be easy, for others difficult, and for some (near)
impossible.

It’s been done for several languages already. You get a nice toy and
perhaps an environment for embedded or specialized systems. But not a
general purpose desktop system because you want the ability to run any
language in which a piece of software happens to e written.

Thanks

Michal

On Sat, 20 Oct 2007, Robert K. wrote:

Some time ago I wrote a Mongrel variation that used fork() on incoming
requests instead of spawning a thread. Throughput on it was lousy,
comparatively. Somewhere around an order of magnitude worse than using
Ruby threads.

Did you fork for every request? If so then it seems there might be a
more optimal solution (starting worker processes and then sending off
requests to them via DRb for example).

Yeah. I was just exploring the idea of memory management through
fork().
Starting worker processes and distributing requests to them is
essentially what is happening when one makes a cluster of mongrels
through
one of the available clustering solutions.

Kirk H.

On Sat, 20 Oct 2007 04:55:18 +0900, Michal S. wrote:

I cannot imagine what else you can do when you want an OS that runs
pretty much all languages. All that the OS can do is hand out pages,
and only the language runtime can manage the data inside those pages.
Unless you tailor the OS to one specific language or virtual machine
you cannot get anything more.

There was a paper a few years ago on the idea of a GC that could
communicate with the virtual memory manager, so that it could sweep
pages
when they happened to be already swapped in, rather than intentionally
pulling them in purely to do the sweep. Or something along those lines;
the key is that right now, you have no way to know if accessing an
address
will produce a page fault.

It makes a lot of sense. I always assume that big VSZs (at least in
garbage-collected apps) are bad because they lead to eventual paging,
but I
don’t know nearly enough to prove or even investigate that.

As for the forking, I concur with the crowd. It’s certainly much
cheaper
than it used to be, but not-forking will always be cheaper than forking,
and that’s why it’s gone out of favor - though if the downside leads to
catastrophic* edge cases, that’s not so hot either.

*I can’t remember the word I’m looking for, and neither can
thesaurus.com,
and it’s the second time in two days I’ve needed it. You know the word
I
mean. Non-linear, extreme, unpleasantly surprising behavior at the
edges,
knee-of-the-curve. Can anyone? New thread or e-mail, please; let’s not
hijack this!

Michal S. wrote:

Well, the memory subsystem is quite underdeveloped on the “general
purpose” OSes. You normally do not get resource accounting unless you
do realtime or some specialized OS but you at least get priorities for
cpu time. Nothing like that for memory. It is all just best effort,
distributed more or less proportionally to the amount of pages the
process has touched recently, and when it runs out something randomly
breaks.

You’re right … memory management technology (hardware or OS) hasn’t
improved substantially since the days of Peter Denning and System\360.
:slight_smile: Part of that is due to the fact that the equations necessary to come
to some reasonable conclusions about alternatives are ghastly. They’re
much more difficult to deal with than those that govern networking, for
example, which is why routers are so smart these days and memory
management is stuck in a time warp.

I cannot imagine what else you can do when you want an OS that runs
pretty much all languages. All that the OS can do is hand out pages,
and only the language runtime can manage the data inside those pages.
Unless you tailor the OS to one specific language or virtual machine
you cannot get anything more.

But that’s pretty much what we have now, that one specific language
being C. There was a time when operating systems and compilers were
written either in assembler or other “system programming” languages like
Bliss. But now most operating systems are written in C, most compilers
and interpreters are written in C, and it’s only end-user applications
that tend to be written in all the other languages. So really, you get
“pretty much all languages” by writing their compilers or interpreters
in C.

So you could tailor the OS (and hardware) to C. (That’s actually where
the “RISC revolution” was headed, until Intel found a way to
out-manufacture the RISC chip vendors.) But that’s not what has
happened. Instead, the OS acts as a kind of “middleware” between
compilers and interpreters and the hardware, and there’s another layer
of middleware inside the chip between the OS and a “RISC core” that
actually does the arithmetic and string operations. The Intel Mac was
only the last nail in the coffin of RISC. :slight_smile:

Well, that’s where you get if you manage the language objects in the
OS (assuming that a lisp machine is the thing where you basically run
lisp runtime on the bare metal). It’s perfectly integrated but you
lose the ability to run other languages easily because you have to map
them somehow to your chosen language. For some that are similar enough
it might be easy, for others difficult, and for some (near)
impossible.

Actually, you can write compilers for other languages on Lisp machines,
and you can write an operating system in Lisp too. I don’t know how well
suited Lisp is to running an OS, but it’s an excellent language for
writing compilers and interpreters. But we don’t have Lisp machines
today for the same reason we don’t have many RISC machines today – the
alternatives had more powerful marketing and manufacturing.

[email protected] wrote:

On Sat, 20 Oct 2007, Robert K. wrote:

Yeah. I was just exploring the idea of memory management through
fork(). Starting worker processes and distributing requests to them is
essentially what is happening when one makes a cluster of mongrels
through one of the available clustering solutions.

Is that what’s known as the “Mongrel Hordes?”

On Sun, 21 Oct 2007 02:31:17 +0900, Michal S. wrote:

possibly optimize the order a bit but the pages will be probably
interlinked in weird ways and unless you do something very clever you
can easily get lost.

Right - I think the idea was, in fact, to do something clever.

I’m also not sure that there’s a good UI available on most (or any?) to
see
your memory map, but I admit I haven’t dug into it. On the only OS I
know
cold, there’s no way to see which pages of yours are currently swapped
in,
but that OS is ancient and irrelevant and doen’t run VMs anyway.

I also imagine that if you’re truly clever, you could store data in such
a
way that you might be able to know that “nothing in that swapped-out
page
is referenced anymore”, and thus be able to free it without paging it
in.
Certainly if the system page size is 4K and I’m freeing a 32K object, I
know that the inner 24-28K doesn’t need swapping in; I don’t know if it
ends up being swapped in today’s VMs. As for smaller objects, I suppose
you’d need to store the metadata in its own page, so that you could
sweep
through object references without sweeping through objects. Again,
dunno
if that’s already being done.

In general, the impression I got from the paper was that today’s VMs
don’t
spend a lot of time thinking about how their memory usage is physically
mapped, and that’s partly because there’s no good way to communicate
with
the OS about that. An API that would allow a VM to hint the OS, and
find
out if pages are swapped in, would seem to help there.

Here’s the paper I think I was looking at; it’s dated 2004, and I don’t
know if changes in Linux and/or HotSpot have surpassed it:

On 20/10/2007, M. Edward (Ed) Borasky [email protected] wrote:

improved substantially since the days of Peter Denning and System\360.
:slight_smile: Part of that is due to the fact that the equations necessary to come
to some reasonable conclusions about alternatives are ghastly. They’re
much more difficult to deal with than those that govern networking, for
example, which is why routers are so smart these days and memory
management is stuck in a time warp.

Actually routers aren’t that smart either. IP succeeded because it
does not need smart routers. Everybody figures that just adding more
bandwidth is easier than to make more efficient use of the current
bandwidth. The state of the art router does (beyond the bare minimum
needed to function as a router) classify the traffic into a few
priority classes and implements some logic that makes higher priority
traffic somewhat more likely to come through. Again, the only thing it
gets over memory management are some crude priorities.

Bliss. But now most operating systems are written in C, most compilers
of middleware inside the chip between the OS and a “RISC core” that
actually does the arithmetic and string operations. The Intel Mac was
only the last nail in the coffin of RISC. :slight_smile:

C is akin to assembly and languages like Pascal that also use pointers
and raw memory access. The evolution went from machine code to machine
specfic assembly and then to C and other languages that try to hide
cpu and platform differences. Also from running on bare metal to
virtualisation (which strikes back today in the form of xen or vmware
that actually allow to divide memory between taks - at some expense)
single task OSes, and multitasking OSes.
To make use of C and similar languages easier, current OSes provide
reusable and shareable services and support for libraries which you
would hardly find in machine code. However, there are very few means
for managing multiple processes actually running in parallel. The
security models of current systems are a joke, there is near
non-existent resource management. It feels like were are halfway
towards multitasking OSes currently.

The RISC cpus weren’t that big win. The instruction set is simpler and
more symmetric (which is what improves over time even for intel, and
the 64-bit version is way better than 32-bit from what I have heared).
Theoretically the simpler instructions could give the programmer more
control to do better optimization. But in practice the optimization
performed by the compilers is lousy on any architecture you pick, and
the simpler instructions require more memory to record the program.
Add some interesting features that expose more of the internal working
of the cpu like delayed branching or imperfect interrupts, and you get
a big mess most of the time. Yes, the compilers might improve over
time. But writing even a working compiler becomes more difficult with
these instruction sets that expose too much.

and you can write an operating system in Lisp too. I don’t know how well
suited Lisp is to running an OS, but it’s an excellent language for
writing compilers and interpreters. But we don’t have Lisp machines
today for the same reason we don’t have many RISC machines today – the
alternatives had more powerful marketing and manufacturing.

I do not see any problem with using Lisp for the system interface. You
will have to write some kernel in a lower level language to provide
the lisp runtime but then you can make the syscall interface in Lisp.
You could probably do lots of stuff that is currently in Linux also in
Lisp.

However, Lisp is a functional language. While there is well known art
of or interpreting procedural languages in procedural languages,
functional languages in functional languages, and even functional
languages in procedural languages, I haven’t heard of an interpreter
of a procedural language written in a functional language (even an
experimental, let alone useful). Since I am not an expert in the field
there might be monographies piling on the topic without me noticing.
So far I have seen only one or two articles about unsolved
difficulties with writing such interpreter.

Even if you wrote a an interpreter for Ruby, Python, and whatnot in
Lisp there is still a fundamental problem that prevents general use. C
(and C++ and assembly) code is used to get the speed nearing that of
running on the bare metal for some specialized tasks. Once you turn
everything into Lisp objects you give up that, and you cannot get it
back.

Thanks

Michal

On 20/10/2007, Jay L. [email protected] wrote:

the key is that right now, you have no way to know if accessing an address
will produce a page fault.

You could probably see your memory map and your pagefaults, after all
it’s your memory. I am not sure it would buy you anything, though. You
have to visit all pages to perform the garbage collection. You could
possibly optimize the order a bit but the pages will be probably
interlinked in weird ways and unless you do something very clever you
can easily get lost.

It makes a lot of sense. I always assume that big VSZs (at least in
garbage-collected apps) are bad because they lead to eventual paging, but I
don’t know nearly enough to prove or even investigate that.

It’s inevitable if the garbage collector accesses all the memory, and
most do. You could have reference counting that only discards stuff on
currently mapped pages (which is what reference counting does most of
the time anyway - it only decreases the count for stuff that is
accessed). However, reference counting is not sufficient to get rid of
garbage so you have to sift through the whole heap eventually, and
that pages it in in its entirety.

Some applications might get away with large VSZ and no paging but it
proves that the memory is in fact useless and probably was not
discarded only because of memory management errors - leaks.

Thanks

Michal

On Oct 21, 2007, at 10:41 PM, Michal S. wrote:

I haven’t heard of an interpreter
of a procedural language written in a functional language (even an
experimental, let alone useful).

There’s Pugs:

http://www.pugscode.org

– fxn

On 10/21/07, Michal S. [email protected] wrote:

However, Lisp is a functional language. While there is well known art
of or interpreting procedural languages in procedural languages,
functional languages in functional languages, and even functional
languages in procedural languages, I haven’t heard of an interpreter
of a procedural language written in a functional language (even an
experimental, let alone useful). Since I am not an expert in the field
there might be monographies piling on the topic without me noticing.
So far I have seen only one or two articles about unsolved
difficulties with writing such interpreter.

One of the examples in the OCaml book is a small Basic interpreter.
http://caml.inria.fr/pub/docs/oreilly-book/html/book-ora058.html

Also, Scheme is mostly functional, but Common Lisp is multiparadigm.

martin

On 22/10/2007, Xavier N. [email protected] wrote:

On Oct 21, 2007, at 10:41 PM, Michal S. wrote:

I haven’t heard of an interpreter
of a procedural language written in a functional language (even an
experimental, let alone useful).

There’s Pugs:

http://www.pugscode.org

Thanks for the replies, looks like I have really missed some recent
stuff here.

Michal

Martin DeMello wrote:

Also, Scheme is mostly functional, but Common Lisp is multiparadigm.

Both are really Lisp 1.5 with some simple core semantics changes and
different libraries. :slight_smile: But seriously, both are “mostly functional” but
contain imperative features. The core semantic difference between Scheme
and Common Lisp is how the language treats “foo” in the following:

(foo arg1 arg2 arg3)