4 Jan 2000 (a long time ago)

Note: I wrote this a long time ago, using compilers that are now a decade old. At the time, the general opinion was that C++ was “much slower” than C, and I wrote this to point out an example of how C++ can be faster in some cases. It seems that the general opinion has changed in the last decade, and people believe that C++ is roughly on par with, or faster than C.

There is a tradeoff between writing special-case code yourself, calling a special-case library routine, and calling a general-case library routine. I would like to have:

As you can see, none of the solutions gives me all three. Given any one goal, there is a corresponding best solution. Given any one solution, I can only get two out of three goals. In this document I present a comparison of sorting in C and C++, and show that with C++ STL, you can get all three.

Tradeoffs in C

I would like to sort an array of numbers in C. I have ten minutes to do this. There are more important things I have to do today than to sort numbers. My choices are to use the qsort built-in library routine, to find a sorting routine that sorts whichever type of numbers I am dealing with (and such a routine may not always be available), or to implement a sorting algorithm myself. A program that uses qsort requires not only the call to the sorting routine, but the definition of a comparison function for numbers. To use a special-case library routine, it’s likely that I have to go find the routine somewhere on the net, but I don’t have to specify the data type or comparison function. A program that has its own sort routine does not need a comparison function; it can use the built-in < operator.

Code size / coding time:

The number of lines required to sort using qsort was 1 for the call to qsort and 10 for the comparison function. In comparison, writing my own quicksort routine required 23 lines of code. Calling a special-case library routine only required 1 line of code.

The program that used qsort required little coding time and no debugging time. My own quicksort routine on the other hand took some time to code and much time to debug (mostly because it was written from memory, without an algorithms book as a reference). The special-case library, like the general-case library routine qsort, had already been debugged.

The special-case library routine is the winner here.

Running time:

The hand-written sort ran between 2.9 times as fast (for floats) to 7.9 times as fast (for bytes). The special-case library routine ran only slightly faster than the hand-written sort.

The special-case library routine and hand-written code are both winners here.

Flexibility:

The qsort routine can be reused for other data types, or for other sorting orders. The hand-written sort only works on one data type and sorting order, but we can just copy the code and make another sort routine that works for another data type or sorting order. (It’s still better not to have to copy and paste.) The special-case library routine does not work on other data types or sorting orders.

The general-case library routine is the winner here.

As expected, there is a tradeoff here. You can have two of: ease of coding, speed, and flexibility. The general-case library routine was flexible and easy to use, but ran slower; the hand written routine was hard to code and somewhat flexible, but ran faster; the special-case library routine was fast and easy to use, but inflexible.

Another Possibility: C++ Templates

In C++, the standard template library (STL) provides a sort routine. I tested STL sort as an alternative to the three options available in C.

Code size / coding time:

With STL, there is no need to define a comparison function, since STL can take advantage of C++ operator overloading. There is only 1 line of code to sort an array.

STL’s solution matches the best solution (special-case library functions) in C, in terms of coding time.

Running time:

As I expected, STL’s sort ran faster than C’s qsort, because C++’s templates generate optimized code for a particular data type and a particular comparison function. STL’s sort also ran faster than the hand-coded quicksort routine, and it ran faster than the special-case library routine. (However, this may simply be unique to sorting, and may not extend to other algorithms.)

STL’s solution exceeds the best solutions (special-case library functions or my hand-written code) in C, in terms of execution speed.

Flexibility:

STL’s sort works for other data types and other sorting orders. (It works for different data containers as well -- C arrays, C++ vectors, C++ deques, and other containers that can be written by the user. This kind of flexibility is rather difficult to achieve in C.)

STL’s solution exceeds the best solution (general-case library functions) in C, in terms of flexibility.

STL’s sort retains the advantages of a specialized library routine: it is optimized for a particular data type and comparison function, so it runs fast. At the same time, it retains the advantages of using a general-purpose library routine: it works for any data type, and works with different comparison functions.

STL’s solutions can do better than the C library solutions because templates allow C++ code to learn more from their environment than functions do in C code. STL’s sort runs 20% to 50% faster than the hand-coded quicksort or the C special-case library function (and 250% to 1000% faster than the C general-case library function). STL has optimized algorithms that I could write, if I had the time and desire to read research papers in journals about the state of the art in sorting algorithms.* However, I don’t have a lot of time, so it is likely that if I were forced to write a sorting algorithm, I would end up writing insertion sort or (if running time was important) quicksort, and my own quicksort is unlikely to be as fast as the one included with STL.

* SGI’s STL is using introsort, a combination of quicksort (used when the subarrays are large and the recursion is shallow), heapsort (used when the recursion is deep), and insertion sort (used when the subarrays are small).

In another test, between sorting C arrays and instances of the C++ vector class, I found that there was no real difference. Running time (at least for sort) isn’t a factor in the choice between C arrays and C++ vectors.

Template functions save both development time and run time. They retain the flexibility of general-case library routines. No longer do we have to make this tradeoff. We can get better algorithms, a good implementation, less coding time, and fewer bugs.

Appendix: Running Times

Data
Type
C
(library)
C
(hand-coded)
Numerical
Recipes
Ratio
1
C++
(C array)
C++
(vector class)
Ratio
2
int5.90-5.921.54-1.651.46-1.503.71.12-1.161.11-1.145.3
short9.03-9.031.73-1.801.58-1.59*5.11.17-1.201.17-1.197.7
byte7.87-7.890.98-1.020.98-1.007.90.70-0.730.70-0.7311.0
float7.08-7.102.38-2.502.48-2.552.91.96-2.021.97-2.023.6
double16.42-16.452.70-2.932.72-2.835.82.28-2.352.28-2.377.1

These tests were performed on a 200Mhz Intel Pentium-MMX running RedHat Linux. The programs were compiled using egcs 1.0.2, with optimization level 6 plus loop unrolling. (-O6 -funroll-all-loops) (I also got similar results on an 1.5Ghz Intel Core on a Mac Mini, using gcc 4.0.1.) Each program was tested on ten different data sets (the same ten data sets were used for each program) and the lowest and highest times were reported in the table. The times shown are running times, in seconds, to sort one million numbers.

Another result that may be of interest is that the performance of the general-case library routine did not vary much with different data sets, while that of the hand-coded sorting routine varied as much as 10% from one data set to another.

Ryan Haksi suggested that I test the code from Numerical Recipes, which is a source for decent, efficient implementations of many (mostly numerical) algorithms. I used this as my “special-case library routine” code. (It is written to work on arrays of floating point numbers, using only the default comparison function, so I think it qualifies as a special-case library.) Unlike my own quicksort code, the code from this book uses several optimizations: it uses insertion sort for small subarrays; it completely eliminates recursion; and it uses a “median of three” pivot (which speeds up the partitioning process).

Krzysztof Bosak points out that STL sorting functions are not the fastest possible sorting functions. If you work hard, you can write faster ones. He has written his own template sorting functions that are faster than SGI STL’s sorting functions on the tests I ran, at least on some machines with some compilers. He started with the Numerical Recipes code, made the stack array static (note that this means you cannot use this code in a multithreaded program), inlined the functions, and made declarations more local. His sorting code is 40% slower than SGI STL sort on my Pentium 1 MMX with egcs 2.91.66, but it is 3% faster on a Pentium III with egcs 2.91.66, and it is a tiny bit slower at high optimization levels on a Pentium III with g++ 2.95.2. Whether they’re faster or slower for you will depend on your compiler and CPU. You can find these functions on his C++ library web page.

Doug Moore sent me a string sorting function that’s faster than STL sort. Again, I think this shows that if you work hard, you can get something faster than SGI STL’s sorting functions.

Andrew Robb, using a different system, was unable to reproduce the results; on his system, qsort was faster than std::sort. He also has a merge sort routine that only works with arrays (not STL vector and deque), but sorts faster than std::sort. This also shows that if you work hard, you can get something faster than STL.

Much of the difference here is because C++ template code is typically inlined and expanded out per type. Is the same possible in C? Yes, although it’s not the most common way to write C code.

Conclusions

I can’t claim that C++ is always faster than C. However, I do claim that C++ templates offer a way to provide library routines that offer the traditional advantages of library routines (ease of use, flexibility) yet still are able to provide speed close to hand-written code. This combination is generally unavailable to programmers using C (or Pascal, or Basic, or Java, or ...). When I programmed in C, I always had to choose between library routines and hand-written code. In C++, I choose (STL) library routines whenever possible. I usually get better (faster and more flexible) code than I would have written myself (without putting in a lot of effort). It’s refreshing to finally be able to use library routines without expecting any performance penalties.

Mini-Rant: Recursion

The code from Numerical Recipes, like many C programs, uses fixed size temporary buffers. What’s worse, the code is not dependent on only the array size (which is reasonable in many cases) but depends on the data values to be sorted. I think that for a library routine to work on some data but not others is very dangerous. (In this case it’s hard to predict when it will fail. That’s even worse!)

While I was testing the code to see how fast it ran on short integers, it failed.

The only reason the code uses a temporary array is that it’s convoluted in order to avoid recursion. Recursion isn’t really very slow when used with inherently recursive algorithms like quicksort. (See the June 1998 issue of Dr. Dobb’s Journal: the column Algorithms Alley has some benchmarks showing that recursion and iteration are not really much different for many programs.)

To avoid recursion (which didn’t seem to buy much speed), the authors of Numerical Recipes went to great lengths to manage their own stack in software.

  1. They believed that their software stack was faster than a hardware stack. If a stack implemented in hardware is slower than a stack implemented in software, the hardware designers did something wrong (in my opinion).
  2. They made their code more difficult to read, and ended up making machine code larger, because they’re reimplementing something that’s already implemented in hardware.
  3. They used a fixed-size stack which breaks on some input. This is extremely poor behavior for a library routine.

Despite all their attempts to make the code fast, they still were significantly slower than the STL code! I originally attributed this mostly to the algorithms chosen -- STL uses introsort instead of quicksort. However, Musser’s introsort benchmarks suggest that introsort is not faster than quicksort except in special cases. (The main advantage of introsort that it has a worst case time bound of O(n log n), unlike quicksort, which has a worst case bound of O(n^2). In the average case, it is no faster.)

I tried changing the code a bit, to see what would happen if I put the recursion back in. Here are the results for sorting integers:

As you can see, the results are pretty close. The (small) win came from eliminating the tail call, not from eliminating recursion altogether. (A tail call is a call after which no work remains to be done. Tail calls can be turned into gotos by standard techniques. To eliminate non-tail calls requires maintaining a stack explicitly.)

The lesson here is is: If you really need the speed, eliminate tail recursion. However, do not eliminate other forms of recursion (which require you to maintain a stack). Your software stack probably isn’t any faster than the processor’s hardware stack.

Mini-Rant: Flamewars

Most of the criticisms I see of C++ are that it’s not like the person’s favorite language. That’s true. It’s different. It’s not the ideal language for every task. If it’s not the right language for your task, you’re probably blaming the language when you should be blaming your boss for forcing you to use the wrong language for the job. Anyway, read this before you flame!

Appendix: Programs Tested

Appendix: References and more reading