GotW #78

Home Blog Talks Books & Articles Training & Consulting

On the
blog
RSS feed November 4: Other Concurrency Sessions at PDC
November 3
: PDC'09: Tutorial & Panel
October 26: Hoare on Testing
October 23
: Deprecating export Considered for ISO C++0x

This is the original GotW problem and solution substantially as posted to Usenet. See the book Exceptional C++ Style (Addison-Wesley, 2004) for the most current solution to this GotW issue. The solutions in the book have been revised and expanded since their initial appearance in GotW. The book versions also incorporate corrections, new material, and conformance to the final ANSI/ISO C++ standard (1998) and its Technical Corrigendum (2003).

Operators, Operators Everywhere
Difficulty: 4 / 10

How many operators can you put together, when you really put your mind to it? This issue takes a break from production coding to get some fun C++ exercise.

Problem

JG Question

1. What is the greatest number of plus signs (+) that can appear consecutively, without whitespace, in a valid C++ program?

Note: Of course, plus signs in comments, preprocessor directives and macros, and literals don't count. That would be too easy.

Guru Question

2. Similarly, what is the greatest number of each the following characters that can appear consecutively, without whitespace, outside comments in a valid C++ program?

a) &

b) <

c) |

For example, for (a), the code "if( a && b )" trivially demonstrates two consecutive & characters in a valid C++ program.

Solution

Who Is Max Munch, and What's He Doing In My C++ Compiler?

Max Munch is the only C++ standards committee member to have a perfect attendance record at all meetings from the very beginning to date. (If you don't believe me, check the attendance list in each meeting's minutes.)

More seriously, the "max munch" rule says that, when interpreting the characters of source code into tokens, the compiler does it greedily -- it makes the longest tokens possible. Therefore >> is always interpreted as a single token, the stream extraction (right-shift) operator, and never as two individual > tokens even when the characters appear in a situation like this:

  template<class T = X<Y>> ...

That's why such code has to be written with an extra space, as:

  template<class T = X<Y> > ...

Similarly, >>> is always interpreted as >> >, never as > >>, and so on.

Some Fun With Operators

1. What is the greatest number of plus signs (+) that can appear consecutively, without whitespace, in a valid C++ program?

It is possible to create a source file containing arbitrarily long sequences of consecutive + characters, up to a compiler's limits (such as the maximum source file line length the compiler can handle).

If the sequence has an even number of + characters, it will be interpreted as ++ ++ ++ ++ ... ++, a sequence of binary ++ tokens. To make this work, and have well-defined semantics because of sequence points, all we need is a class with a user-defined prefix ++ operator that allows chaining. For example:

  // Example 1(a)
  //
  class A
  {
  public:
    A& operator++() { return *this; }
  };

Now we can write code like:

  A a;
  ++++++a; // meaning: ++ ++ ++ a;

which works out to:

  a.operator++().operator++().operator++()

What if the sequence has an odd number of + characters? Then it will be interpreted as ++ ++ ++ ++ ... ++ +, a series of binary ++ tokens ending with a final unary +. To make this work, we just need to additionally provide a unary + operator. For example:

  // Example 1(b)
  //
  class A
  {
  public:
    A& operator+ () { return *this; }
    A& operator++() { return *this; }
  };

Now we can write code like:

  A a;
  +++++++a; // meaning: ++ ++ ++ + a;

which works out to:

  a.operator+().operator++().operator++().operator++()

This trick is fairly simple. Creating longer-than-usual sequences of other characters turns out to be a little more challenging, but still possible.

Abuses of Operators

The code in Examples 1(a) and 1(b) don't especially abuse the ++ and + operators' usual semantics. What we're going to do next, however, really goes beyond anything you'd ever want to see in production code; this is for fun only.

2. Similarly, what is the greatest number of each the following characters that can appear consecutively, without whitespace, outside comments in a valid C++ program?

For this question, we'll create and use the following helper class:

  // Example 2
  //
  class A
  {
  public:
    void operator&&( int ) { }
    void operator<<( int ) { }
    void operator||( int ) { }
  };

  typedef void (A::*F)(int);

a) &

Answer: Five.

Well, && is easy and &&& not too much harder, so let's go right to the next level: Can we create a series of four &'s, namely &&&&? Well, if we did, they would be interpreted as "&& &&", but expressions like "a && && b" are syntactically illegal; we can't have two binary && operators immediately after each other. The trick is to see that we can use the second && as an operator, and make the first && come out as the end of something that's not an operator. With that in mind, it doesn't take too long to see that the first && could be the end of a name, specifically the name of a function, and so all we need is an operator&&() that can accept a pointer to some other operator&&() as its first parameter:

  void operator&&( F, A ) { }

This lets us write:

  &A::operator&&&&a; // && &&

which means:

  operator&&(&A::operator&&, a);

That's the longest even-numbered run of &'s we can make, because &&&&&& has to be illegal. Why? Because it would mean && && &&, and even after making the first && part of a name again, we can't make the final && the beginning of another name, so we're left with two binary && operators in a row which is illegal.

But can we squeeze in one more & by going to an odd number of &'s? Yes, indeed. Specifically, &&&&& means && && &, we already have a solution for the first part, and with not much thought it's easy to tack on a unary &:

  &A::operator&&&&&a; // && && &

which uses the builtin operator&&() that can take pointers:

  operator&&(&A::operator&&, &a);


b) <

c) |

Answer for both: Four.

Having seen the solution to 2(a), this one should be easy. To make a series of four, we just use the same trick as before, defining a:

  void operator<<( F, A ) { }
  void operator||( F, A ) { }

which lets us write:

  &A::operator<<<<a; // << <<
  &A::operator||||a; // || ||

which means:

  operator<<(&A::operator<<, a);
  operator||(&A::operator||, a);

That's the longest even-numbered runs we can make, because <<<<<< and |||||| have to be illegal, just as we've already noted that &&&&&& has to be illegal. But this time we can't manage an extra < or | to make a series of five, because there is no unary < or | operator.

Bonus Challenge Question

Here's a bonus question: How many question mark (?) characters can appear in a row in a valid C++ program?

Think about the question before reading on. It's a lot harder than it looks.

* * * * *

Do you have an answer?

You might think that the answer must be one, because the only legitimate token in the C++ language that includes a ? is the ternary ?: operator. It's true that that's the only legitimate language feature that includes a ?, but "there are more things in the translator and preprocessor, Horatio, than are dreamt of in your language syntax rules..." In particular, there's more to C++ than just the language, or even just the preprocessor.

For ?, the correct answer is three. For example:

  1???-0:0;

This question is harder than the others in part because it's the only one that doesn't follow the maximal munch rule. The three question marks, ???, are not interpreted as the tokens ?? and ?. Why not? Because ??- happens to be a trigraph, and trigraphs are replaced very early during source code processing -- before tokenization begins, even before preprocessor instructions are handled. If you haven't heard about trigraphs before, don't worry; that just means that you don't use an exotic foreign-language keyboard. A trigraph is an alternate way to spell certain unusual source code characters (specifically #, \, ^, [, ], |, {, }, and ~), provided for the benefit of programmers whose keyboards don't happen to have a key for that character.

In this case, long before any tokenization can occur the trigraph ??- is replaced with ~, the one's-complement operator. Therefore the statement above becomes:

  1?~0:0;

which is tokenized as:

  1 ? ~ 0 : 0 ;

and means:

  1 ? (~0) : 0 ;

Trigraphs are a feature inherited from C, and they really are rare in practice. Just to give you an idea of how rare, note that as of this writing several compilers I know of do not have trigraph support turned on by default, and one of those documents it as "enables the undesirable and rarely used ANSI trigraph feature." At least one compiler provides a separate filter to process trigraphs before running the code through the compiler.

Copyright © 2009 Herb Sutter