Optimizations That Aren't (In a Multithreaded World)
This article appeared in C/C++ Users Journal, 17(6), June 1999.
Do you write multithreaded programs? Chances are that you do, or work in a group where others do. Multithreading is getting more popular because can make our programs more responsive, more efficient and more scalable on multiprocessor machines. But even if you only write single-threaded programs, keep reading... if you work in a company (or share libraries) with other projects that are multithreaded, you're probably using multithread-safe builds or versions of some shared libraries, and what I'm going to talk about probably affects you today.
Do you care about performance and optimization? Again, chances are that you do. No matter how fast our hardware gets, performance is important. Optimization and multithreading are frequently related, as we use multiple threads to achieve higher performance by letting operations that wait for different resources run in parallel.
What's the Fuss About?
The point of this article is that some widely-used library optimizations are effective in single-threaded mode, but can actually degrade performance if the library is built for possible multithreaded use. Here I'm talking about common and popular optimizations that you may be using already, even if you don't know it, because they're present deep inside many popular vendors' libraries.
Here's more bad news: Often the performance impact is severe, far heavier than it needs to be, even in some of today's most popular shipping commercial libraries. Why? Because there's often more than one way to make code thread-safe, and some library writers pick methods that work all the time but aren't the most efficient for their particular situation. In particular, certain popular libraries still use general-purpose constructs like critical sections when more efficient alternatives would do; it makes a big difference.
And here's the worst news of all: Above I wrote "if the library is built for possible multithreaded use." This means the degradation can affect you even if your program happens to be single-threaded.
In this article, I'll demonstrate one type of optimization that is often a false optimization but is commonly found in commercial libraries; I'll also show an easy way to detect it. Using similar techniques, you can determine whether you're being affected by similar hidden pitfalls in the libraries that you're using now. If you are affected, you'll also have the information you need to report the situation to your library vendor so they can correct the problem.
I'm going to illustrate the general problem with a specific example: A string class that uses the copy-on-write (COW) optimization. But please note:
o COW isn't the only culprit. Everything I'll discuss applies to any transparent optimization where two or more objects can share information or state behind the scenes, where client code can't see. This includes, but is not limited to, shared-implementation or lazy-copy optimizations like COW.
o Strings aren't the only culprits. Optimizations like these can and have been used for many expensive-to-copy data structures, including collections and containers.
My examples will be in C++, but this discussion applies to any language. One reason I'm writing about this topic is that standard C++ strings have acquired an undeserved "slow and inefficient" reputation in some circles. True, there certainly are needlessly inefficient implementations of C++'s string class out there on the market today; but I want to emphasize the "needlessly" part, point out that there are high-quality implementations too, and show you how to tell them apart so that you can avoid the former.
Introducing a Plain Old String
First, I'm going to show code examples that use a "plain old string," meaning a string class that doesn't perform any copy-on-write (or other shared-information) optimizations under the covers. Whenever you copy or assign this kind of string, the target string physically copies the source string's contents into its own internal buffer, growing its internal buffer if necessary. The idea is that, as soon as the copy or assignment operation is complete, you really do have two separate and independent string objects.
Given a "plain old string," what does client code have to do in order to use it safely in single- and multithreaded programs?
Using a Plain Old String: Single-Threaded
Say that we want our program to keep track of the last error message we encountered, and automatically decorate it with the time the message was generated. We might implement it using a global string value with helper functions to get and set the state:
// Example 1: A simple error recording subsystem
String SetError( const String& msg )
As long as our program is single-threaded, we never have to worry about GetError or SetError being called at the same time on different threads, and all is well. For example, given:
String newerr = SetError( "A" );
The output would be something like:
1: A (12:01:09.65) (set by A)
Using A Plain Old String: Multi-Threaded
Even if you're already familiar with thread safety issues, and know why and how to safely share resources between threads by serializing access using critical sections or mutexes, please skim this section anyway; it forms the basis for the more detailed examples later on.
In a nutshell, things are a bit different if our program is multi-threaded. Functions like GetError and SetError might well be called at the same time on different threads. In that case, calls to these functions could be interlaced, and they will no longer work properly as originally written above. For example, consider what might happen if the following two pieces of code could be executed at the same time:
// thread A
// thread B
There are many ways that this can go wrong. Here's one: Say that thread A is running and, inside the SetError( "A" ) call, gets as far as setting err to "1: " before the operating system decides to preempt it and switch to thread B. Thread B executes completely, then thread A is reactivated and runs to completion. The output would be:
2: B (12:01:09.125) (set by B)
It's easy to invent situations that produce even stranger output, but in reality you'd be lucky to get anything this sensible. Because a thread's execution can be pre-empted anywhere--including right in the middle of a String operation, such as String::operator+= itself--you're far more likely to see just intermittent crashes as String member functions on different threads attempt to update the same String object at the same time. (If you don't see this problem right away, try writing those String member functions and see what happens if their execution gets interleaved.)
The way to correct this is to make sure that only one thread can be working with the shared resources at a time. We prevent the functions from getting interlaced by "serializing" them with a critical section or mutex or similar device. But who should be responsible for doing the serialization? There are two levels at which we could do it, and the main tradeoff is that the lower the level at which the work is done, the more locking needs to be done, and often needlessly--that's because the lower levels don't know whether acquiring a lock is really necessary for a given operation and so have to do it every time, just in case. The reason why excessive locking is a major concern is that acquiring a mutex lock is typically an expensive operation on most systems, approaching or surpassing the cost of a general-purpose dynamic memory allocation.
1. [WRONG] Do locking within String member functions. This way, the String class itself assumes responsibility for the thread safety of all of its objects. This is a bad choice for two (probably obvious) reasons: First, it doesn't solve the problem because it's at the wrong granularity: Option 1 only ensures that String objects won't get corrupted (that is, they're still valid String objects as far as String is concerned), but it can't do anything about the SetError function's interleaving (the String objects can still end up holding unexpected values exactly as just illustrated above, which means they're not valid error messages as far as the Error module is concerned). Second, it can seriously degrade performance because the locking would be done far more frequently than necessary--at least once for every mutating String operation, and possibly even for nonmutating operations!
2. [RIGHT] Do locking in the code that owns/manipulates a String object. This is always the correct choice. Not only is locking done only when it's really needed, but it's done at the right granularity: We lock "an operation" where an operation is, not a low-level String member function, but a high-level Error module message-formatting function. Further, the extra code to do the serialization of the error string is isolated in the Error module.
3. [PROBLEMATIC] Do both. In the example so far, Option 2 alone is sufficient. Option 1 is so obviously a bad choice that you might be wondering why I'd even mention it. The reason is simple: Later in this article I'll demonstrate why certain "optimizations" force us to make that performance-degrading choice and do all locking inside the class. But because (as shown above) Option 1 by itself doesn't really solve the whole problem, we can end up having to do Option 1, not instead of, but in addition to Option 2. As you might expect, that's just plain bad news all around.
In our example, implementing Option 2 means that the Error subsystem should take responsibility for serializing access to the String object that it owns. Here's a typical way to do it:
// Example 2: A thread-safe error recording subsystem
string SetError( const String& msg )
Now all is well, because each function body is now atomic as far as err and count are concerned and no interlacing will occur--SetError calls are automatically serialized so that their bodies will not overlap at all.
For more details about threads, critical sections, semaphores, race conditions and related topics, see any good text on operating systems or multithreaded programming. For the rest of this article, I'll assume you know just the basics.
Now Throw In an "Optimization": Copy-On-Write
Probably the most well-known string optimization is called "copy on write" (COW), a form of lazy copy. COW doesn't apply only to strings, but the motivation is straightforward: Copying or assigning a plain old string usually involves a general-purpose dynamic memory allocation, which is typically an expensive operation (compared to, say, a function call), and so copying a string must become an expensive operation. At the same time, often client code will make copies of string objects, but then never actually modify the copies. For example, client code can make a copy of a string object, perform some read-only operations like printing it out or calculating its length, and then destroy the copy. When that happens, it seems wasteful to have done the work to create a distinct copy only to find out later that it wasn't really needed after all.
So what's different about a COW string is that, whenever you copy or assign such a string, the target string won't immediately make a copy of the source string's contents; instead, it will just point itself at the source string's internal buffer and continue sharing that internal representation for as long as possible. Read-only operations on either (source or target) string won't care, after all. Only when a possibly mutating operation is attempted on one of the strings will the string finally be forced to make a physical copy--called a "lazy copy" because our lazy string only does work if it really needs to.
Is COW worth it? In a single-threaded program, the answer is: Sometimes. Whether it's really an optimization depends heavily on how string objects are used. See Rob Murray's discussion and empirical measurements in C++ Strategies and Tactics for more information about COW in the real (but single-threaded) world.
The real fun starts when you move to a (possibly) multithreaded environment. As it turns out, using COW can exact an unexpected performance penalty if thread-safety is important, and a severe one in some popular (but unnecessarily inefficient) commercial implementations.
Using a COW String: Multi-Threaded
Now let's return to our friend, the Error subsystem. Here's the problematic client code again:
// thread A
// thread B
As before, the issue is that there has to be some serialization (using, for example, a mutex lock) in order to avoid corrupting the Error module's internal strings and produce reasonable output. But who should be responsible for doing the serialization? Consider the two levels again:
1. Do locking within String member functions. As before, there are two problems with this: It's at the wrong granularity to solve the SetError function's interleaving, and it can seriously degrade performance because the locking would be done at least once for every mutating (and possibly even nonmutating) String operation. What's different from before, however, is that this time Option 1 is necessary anyway (see below).
2. Do locking in the code that owns/manipulates a String object. Again, this is necessary, otherwise we haven't solved the SetError interleaving problem.
3. [NECESSARY] Do both. This time the story isn't so good: Option 2 is necessary, but we have to do Option 1 too because the code that owns/manipulates a COW string can't. Specifically, client code alone can never lock the right strings in the right way because it has no way of knowing which visible string objects might be sharing a representation at any given time. Look again at the thread-safe Error subsystem code from Example 2--the locking that it does is still necessary but is now insufficient, because even though we serialize all access to the internal err error string object, the client threads are taking and manipulating copies of that object, which copies might share representation.
For example, thread A could be safely serialized inside its SetError call and manipulating the Error module's err string, at the same time as thread B (already past its safely serialized SetError) is appending to its newerr string. Note that B's newerr string began life as a copy of the err string. There's nothing obvious linking the internal err string and B's external newerr string--in fact, the Error subsystem has no way of even knowing that any of the newerr strings exist, and vice versa. But if the two strings happen to be sharing representation, which is not only possible but likely in this example, now two threads could both be attempting to execute String member functions on the same string representation even though everything outside String itself is safely serialized.
So that's why it's not enough for the Error subsystem to perform its usual duty of care. For COW strings, the String objects must additionally always protect themselves. This always forces the string class to incur some overhead, but in practice the overhead can be disastrously expensive if the implementation additionally incurs needless inefficiencies.
How severe is the problem? One developer I've heard from recently reports that he has a single-threaded test program which exercises his (popular) vendor's COW-based library features; the program executes in 18 to 20 seconds when the library is compiled for multi-threaded use, and 0.25 seconds if it's compiled for single-threaded use. (The difference is greater than the numbers imply, because both times include program startup.) It's important to note here that this developer's test program was only single-threaded, not multithreaded, but because the library had been compiled for possible multithreaded use the problem affected him anyway. If 20 programs share a library, and only one program is multithreaded, the library must be built for multithreaded use--and any problem will affect all 20 programs.
Mild vs. Severe Inefficiencies
So a thread-safe COW implementation always has to incur serialization overhead, but there's still a vast difference between "necessary" and "indefensible" overhead.
To cut a (very) long story short, it turns out that a thread-safe COW implementation can be written in such a way that the only data to which access needs to be serialized is the reference count itself. (For an exhaustive explanation of this and other aspects of COW and alternative thread-safe implementations, see Guru of the Week issues #43, #44, and #45 at www.gotw.ca. GotW #45 includes a test harness with sample well-optimized source code for no less than seven different implementations of a limited String class--five COW implementations and two non-COW implementations. This article summarizes and expands on some of the results.)
So here's the thing: Since the only shared data that needs serialization is the reference count--typically a single integer--then you can be a lot more efficient than just using a heavyweight general-purpose thread safety solution. Specifically, instead of using a mutex or a critical section (unavoidable when you have to serialize access to larger structures), you can get away with just using the following atomic integer operations if they are available on your system: atomic increment, atomic decrement-and-test, and atomic comparison. These atomic integer operations still incur overhead because they're necessarily slower than plain integer operations, but at least they're still much faster than a more sophisticated construct like a critical section or a mutex. Often they can be implemented in a single native assembler instruction, or a short series of instructions.
To get a feel for the difference, consider alternative implementations for one member function of a C++-based String class: operator, which returns a reference to a character at a given offset in the string. For a plain (non-COW) implementation, it would look something like this, where buf_ is a pointer to the String object's internal buffer where it stores its representation:
// Plain (non-COW) operator.
A COW implementation looks pretty much the same, except that it must first ensure that the representation is unshared (in operator's case, there are also reasons why the function should return a non-const reference and why the representation should also be marked unshareable, but that's not directly relevant here so I'll omit it; see GotW #45 for those details):
// COW operator.
The key to all this is what has to be done inside EnsureUnique for the different flavours of COW. Here are simplified implementations, which ignore "unshareable" flags and other more sophisticated operations. Here data_ points to the internal (and possibly shared) representation, which includes the reference count (refs):
// Thread-unsafe COW: no locks.
// Thread-safe COW: atomic integer operations.
// Thread-safe COW: critical sections.
COW will always add overhead, especially, if it's to be made thread-safe, but this should show why there is no reason to use something as indefensibly inefficient as a critical section when atomic integer operations alone will do.
Some Actual Numbers
The difference between "necessary" and "indefensible" overhead can be profound. Here's an example taken from GotW #45's test results (the entire test harness source code can be downloaded via a link on the #45 page, if you'd like to try it on your system). The particular test I'm showing here makes copies of 50-char strings in a tight loop; one-third of the copies are never modified (the classic case for COW optimization, where the deep copy is entirely avoided), and the rest of the copied strings are modified exactly once each. I'm showing the results for five flavours of the String class:
o "Plain" is a plain non-COW string.
o "Plain_FastAlloc" is the same as Plain but using an optimized memory allocator instead of the default MSVC new allocator (note however that the latter is very good in its own right).
o "COW_Unsafe" is an efficient but thread-unsafe COW string.
o "COW_AtomicInt" is an efficient thread-safe COW string that uses atomic integer operations for serializing access to the reference count.
o "COW_CritSec" is the same as AtomicInt but uses critical section locks instead of atomic integer operations.
o "COW_Mutex" is the same as AtomicInt but uses mutex locks instead of atomic integer operations.
Here are the results (test code compiled using MSVC 5.0 SP3, running on a PentiumII under Windows98):
String Implementa- Elapsed Time
This is just one scenario, of course. It may or may not be typical of the way your program uses strings. My main purpose in sketching these results is to illustrate the following:
o Thread-safe COW is necessarily less efficient than thread-unsafe COW. In this test case, the thread safety overhead made the difference between COW's being an optimization and its being a pessimization.
o Poorly-implemented thread-safe COW (specifically COW_CritSec, which is implemented the same way as a certain popular string library) incurs a severe performance penalty. Note: If your platform lacks atomic integer operations, then you have no choice but to use a more heavyweight solution and COW becomes a severe pessimization for nearly all cases.
o It's not wrong to want to optimize, but here a far better (and simpler) optimization is to optimize the memory allocation, not the copying. Note also that optimizing the allocator has no thread safety implications. Moral: Don't guess. Measure. Only optimize after measuring where you bottlenecks really are. (Of course, you could always choose to perform both optimizations, but development time is never infinite; which one gives you more bang for the buck here?)
o Again, this test harness was single-threaded. You pay for the overhead whenever you use a library built for possible multithreaded use, whether you happen to need it or not.
For more detailed measurements and discussion, see my Guru of the Week links. One of the other points discussed there is that for compilers with reasonably efficient default allocators, especially for large strings, COW's advantage is not that it avoids the cost of the memory allocations, but rather that it avoids the cost of copying the characters in the string--a result you might find surprising.
A Real-World Example: C++'s basic_string
You may be wondering whether these false optimizations really affect you. The chances are pretty good that they do. As a common example, if you use an OO language with a string class, you might well be affected.
Historically, COW has long been a popular optimization in many libraries for all sorts of data structures, including strings. Because of this history, the Standard C++ basic_string template was deliberately specified in such a way that vendors would have the option of producing a conforming implementation that used COW. Perhaps unsurprisingly, all popular implementations that I'm familiar with have done so, even though COW incurs some of its own overhead, is harder to code, and is a frequent source of subtle bugs and interactions.
Because at PeerDirect we write engines that maintain large and widely distributed systems, everything we write has to be multithreaded--it's the most effective way to get the scalability and parallelization you need when dealing with scores of concurrent communications sessions feeding up to terabyte-sized databases. Imagine our surprise a few years ago when we brought in a certain vendor's implementation of the Standard C++ library and tested it in a single-threaded environment, then rebuilt the library with the multithreaded switch set only to discover that many of our programs ran several times more slowly than before. "What could be the problem?" we wondered, scratching our heads. "It looked okay in the test harness. We just made some changes in our own code too; could it be that? Are we blocking on some high-level critical sections more than we thought? Did we introduce an inefficiency somewhere?"
A little investigative work with a profiler isolated the problem, but only deepened the mystery: Our program was spending most of its execution time just acquiring and releasing critical section locks, far more so than could be reasonably expected from the locking performed by our own code. So where was all the locking coming from?
Curious, we investigated further and finally unearthed the problem: A few minutes of digging through the library's source showed that this vendor, like many others, had used the COW optimization in its implementation of basic_string. Naturally enough, we were using standard C++ strings widely instead of C-style char*'s, and nearly every string operation was causing a lock to be acquired and released.
We took a few minutes to whip up a ten-line sample program that did some reasonable string manipulation in a timed loop, and it ran 34 times slower under the multithreaded build of that library than it did under the single-threaded build. We reported the problem, but today that library still uses COW--and an unnecessarily inefficient implementation of COW at that. The vendor's name doesn't matter, because to my knowledge all of the most popular implementations of basic_string on Windows and other platforms use COW. Not all are this inefficient, but some are.
Are You Affected?
If you're using libraries built for possible multithreaded use and performance matters to you, you can and should do the same thing we did: Run profiles of your code to determine the data structures you use most (do this by member function hit count, not necessarily by time spent in the function). Create sample programs that manipulate those data structures in a loop, and time those programs running with single- and multithreaded versions of the corresponding libraries. If there's a substantial difference, tell your vendor, especially if you can tell from the size of the difference whether it's even less efficient than it needs to be.
Library implementers are generally nice and reasonable people. Certainly all the ones that I've met personally are. It's their business to provide the highest-quality code to their customers, and if there's a competitive advantage they can gain over other library vendors, they're usually happy to hear about it and provide it, and getting rid of COW implementations falls into the "good optimizations" category in most cases (and general-purpose libraries care most about "most cases").
So why are COW and similar "optimizations" so prevalent? Probably the biggest reason is that, despite the volume of trade press about multithreaded code, it's far from clear how much we as an industry are really writing multithreaded business applications. If it were to become clear to library vendors that a significant portion of their customers were using multithreaded builds of their libraries, the vendors could be expected to abandon possible false optimizations like copy-on-write.
The performance penalty that COW exacts in a multithreaded program can be startling. Unfortunately, COW is used routinely in library implementations--not just for strings, but for many kinds of data structures that are expensive to copy. For small strings, the main cost that COW tries to avoid is usually just the memory allocation; for larger strings, the copying of the characters can dominate the deep-copy time. For other structures, such as a generic container like the standard vector, the cost of copying the contained objects can easily dominate the total cost even for small collections, because you can't just memcpy the contents of a vector<T>; you have to copy each contained T object, which costs at minimum one function call (the T copy constructor, if it's not inlined) and can cost much more than that for T objects that have complex internal representations. Vendors may choose to implement COW for containers and other structures, not just strings, although I've most often seen COW used for strings rather than other things.
And remember that the problem isn't just COW. Any shared-information optimization that joins two objects in some way "under the covers" will have the same problems.
There's nothing wrong with optimizing complex data structures. The bad news is that some very common optimization methods actually aren't always optimizations at all, and can exact performance penalties that will get continually more noticeable as more of us write multithreaded programs or share libraries with others who do--especially when the libraries are written needlessly inefficiently because of inexperience with thread safety issues.
The good news is that there are usually other optimizations that can give the same effect but that don't have performance impacts in multithreaded mode. For example, in C++, providing optimized custom allocators via an overloaded operator new is nearly always the right way to deal with memory allocation performance issues.
The trick is to use a true optimization, not a false optimization that (often unintentionally) optimizes library code only for single-threaded use.
1. Even just interrupting a thread between its SetError call and the following cout statement could affect the ordering of the output (though not the contents). Exercise for the reader: What are the thread-safety issues relating to cout itself, both within the standard I/O subsystem and within the client code? Consider first the ordering of partial output.
5. Certain popular commercial libraries shipping today have variants of precisely this COW bug, where modifying one visible string also incorrectly modifies a different visible string because the libraries fail to perform the deep copy soon enough.
6. COW advocates may complain that the difference between COW and non-COW is most noticeable with possibly-mutating functions like operator. That's true, but this is the easiest function for showing a quick and clear example of the implementation differences between non-COW and various flavours of COW.
8. Getting the COW code right is more complex than you might think. The most experienced library writers I know have told me repeatedly that they've found their COW code to be a frequent source of bugs. These are people with decades of experience crafting quality libraries, not first-timers.