Threads, Tasks, Coroutines, Processes, and Events

Stanislav Shalunov, 2005-07-12

Summary: The common fallacy that concurrent tasks require threads is debunked. I am not an expert on threads and feel awkward writing this essay, but I have not found any concise formulation of this argument anywhere and had to write it down. Feel free to correct me.

An event queue is even cute.

Definitions

The following are some of the terms used below. I don't insist that these are the correct definitions; I simply need them to explain the differences.
Kernel Scheduling Entity (KSE)
The unit of kernel scheduling. Something that can receive timeslices from the kernel, be preempted, and block on kernel events.
Task
A sequence of actions to be performed by a program while keeping the same context.
Coroutines
Concurrently executing pieces of code within the same program that use cooperative multitasking to ensure each makes progress.
Thread
The term is very much overloaded. I use it here to refer to POSIX threads, Microsoft Windows threads, Solaris threads, etc. Details of implementations can differ (e.g., FreeBSD has several different POSIX thread implementations that work differently).
Process
I'm afraid to try to give a definition of a process after all the giants who failed. You know what a process is. An operating system process, a KSE with its own address space.
Event loop
A common technique of organizing a program that obtains external events of some kind and processes them as they come.

An example

Suppose we need to write a program that: How would you solve this toy problem? Stop reading until you formulate the answer.

Programmers that come from different traditions can suggest different ways of implementing it:

  1. With processes: a process enters a blocking accept() loop; when a new connection is established, a child process is spawned with a fork() call; any given child process handles its own session: it reads requests, keeps a data structure for the context (perhaps a hash with variable names as keys), and writes responses. There's no communication between the children, and very little communication between a child and the parent (the parent needs to spawn the child, then read and discard or record its exit status).
  2. With threads: a similar architecture with threads replacing processes.
  3. With an event loop: only a single routine executes; a data structure records the state of the system; the data structure could be, e.g., hash of hashes (client identifiers such as sockets on first level of hashing and variable names on the second level of hashing); another data structure is used to record partial inputs (e.g., a hash with client identifiers as keys and strings as values); an analogous data structure holds outputs to be sent to different clients; all sockets are put into non-blocking mode; a select() loop governs program execution.
Do you fit into any of these categories? If you've selected the solution with processes, you're probably a UNIX programmer who hasn't been exposed to Solaris for too long; you might enjoy reading this. If you've selected an event loop, you already know what I'm going to say; don't waste your time---go program something. If you've selected the threads solution, I wrote this for you. If you didn't get the event loop business or think that parallelism (in the sense of multiple tasks) necessarily requires threads, I would like to point out that it does not.

(If your solution doesn't fit into this trichotomy, you can go back to writing your program that constructively extends Rice's theorem by taking two programs, performer and detector, on input, and always, as long as the detector can print ``no'', always stops and prints ``yes'' or ``no'', and prints ``yes'' with performer on input, outputs a program that represents the same function as the performer and that elicits ``no'' from the detector. I know, junior-level nonsense. Well then, try it in Lisp. Sorry about the digression; I do need to deal with hecklers.)

Other solutions

Many other solutions for the problem exist [C10K]. They are, generally, either refinements of one of the approaches (e.g., a particular mechanism, such as select(), poll(), /dev/poll, or kqueue() to obtain notification of events), or combinations of the approaches (e.g., an event loop running within multiple threads). These details can be extremely important and difficult to get right [DARC04], but here only the basic approach is discussed.

The Polemics

The widely cited argument against threads [OUST95] actually advises to use threads for high-end applications (e.g., databases) and applications that require true CPU concurrency on multi-processor systems. It advises to mostly keep even multi-threaded code single-threaded, with just event-specific threads. Interestingly, the author both claims that threads can have (usually have? always have?) poor performance and advises to use them only in performance-critical contexts, reserving event programming for distributed systems, GUIs, and the like.

What is widely seen as an attempt to refute [BEHR03] the anti-thread thesis uses, as its core argument, the fact that threads and events are dual (this was known for a long time [LAUE78]). Thus, you can do the same things with the two mechanisms, with essentially the same performance. Therefore, one should use the mechanism that's more natural for the task at hand. The authors proceed to claim that for most tasks, threads are the more natural approach; two very valid points are (i) the difficulty of error handling with events and (ii) the greater probability of memory leaks with events. The authors admit that for fan-in and fan-out operations, such as multicasting, the event model is more natural.

But (yes, there is a ``but'' the size of a blue whale), the paper states that there is no well-performing thread implementation (other than what the authors rigged up for their paper). The authors propose a number of improvements to address performance and other limitations. Until these improvements are understood, implemented, deployed, and tested at least by a few of popular operating systems, the authors' argument that threads can work well is also an argument that they currently do not. Incidentally, the authors argue for a user-level thread implementation with cooperative multitasking, with the kernel knowing little about threads. Most of the recent movement in this area is towards more kernel involvement in threads, not less (e.g., Solaris 10 replaces an N:M implementation, where multiple threads, implementing multiple tasks, can exist within a KSE, with a 1:1 implementation, where one thread is exactly one KSE).

So, should I use threads or events?

In theory, a proper implementation of threads is as good as a proper implementation of events, so one can choose freely the mechanism that is more convenient. If it's convenient to use a thread for each task, do so; if fan-out makes it more convenient to write the application as an event loop, do so.

However, in practice, currently, since all threading libraries are so bad, you should seriously consider using events. Here are the possible reasons to use threads:

  1. You need concurrency for a performance-critical application, and a hardware source of concurrency exists;
  2. You don't need portability and are using a good threads library not available to the rest of the world (and threads are the more natural approach to your problem);
  3. You were taught thread-based programming, and cannot think off-hand of another way to implement tasks.
The first reason is a solid one. The second is dubious: you trade off portability for being able to write more natural code; if one always did that, almost all programs would have been written in domain-specific languages running on exotic platforms. The third, needless to say, is not a reason at all.

OK, I'm using threads. How many should I have?

What is your reason for using threads? If it's 3 (that you want to avoid using events), I can't really help you, as I don't consider this a valid reason. If it's 2 (that it's more natural and hurts nothing), then use as many threads as you have tasks or whatever feels natural. If it's 1 (that you need to exploit potential parallelism in hardware), then the number should be related to the number of things that can happen in parallel.

The number of parallel processes that can happen in a computer is related to the number of CPUs, disks (disk I/O events, unfortunately, cannot be efficiently used in event-based programs on UNIX: even if you set the file descriptor to nonblocking, you will still wait, as select() will always report that file descriptors for regular files are ready, even if it'll take 10ms for a disk seek before data comes), and various other I/O channels that behave like disks. The number of threads should be related to this degree of potential parallelism; not necessarily equal (performance might be better if the number of threads is the degree of parallelism plus or times some small constant), but related to it, rather than the number of tasks. Note that as the number of threads gets larger than the number of CPUs, the potential for unnecessary context switches increases, so the small constant mentioned above should actually be small.

References

[LAUE78]
Hugh C. Lauer, et al. On the duality of operating system structures. IR1A. October 1978.
[OUST95]
John Ousterhout. Why threads are a bad idea (for most purposes). September 1995.
[BEHR03]
Rob von Behren, et al. Why events are a bad idea (for high-concurrency servers). HotOS IX. May 2003.
[C10K]
Dan Kegel. The C10K problem. November 2003.
[DARC04]
Jeff Darcy. High-performance server architecture. March 2004.

Comments