Sep 212013

During the last weeks I spend some time in optimizing a C++ string class that we use in our application. You may ask, why one should spend time today in writing ones own string class. There are already so many available for C++ like std::string and QString.

Many years ago we decided to implement because of different reasons our own string class in C++. One reason was that we have to handle very often UIDs in our domain – medical applications. They have a maximum length of 64 chars.  The Visual Studio’s STL implementation is highly optimized for strings with a max length of 16 bytes. So we decided that our class should have a chunk of 128 bytes to avoid unnecessary additional heap allocations.

I don’t put my focus in this article on all details of these methods. So don’t blame me please if the code is not complete. My only focus goes on the part which I optimized. So these methods are simplifications on certain parts.

Assignment Operator

This is the original implementation

The problem with memcpy on such a small chunk is that it is a generic function. With the knowledge that the chunk has a size of a multiple of 16 bytes and that our application has the general requirement of SSE3 supported CPUs I developed the following optimization:

So the magic starts at line 13. My idea is to use XMM registers to copy 16 bytes in a chunk. So for simplification not the exact number of chars are copied, but up to 15 bytes too much.
Inside the loop 16 bytes are copied into an XMM register and then copied back to the new memory location.

Equal Operator

In line 17/18 I define a variable of the size of an XMM and I load the content from both chars into an XMM register. Then I compare each of the loaded bytes and in line 23 I get a bitmask of all equal bytes. In case that any byte is different than one of the other string, then the bitmask would be un-equal to 0xFFFF and we can skip any further inspection. Both strings are not equal. So we increment the loop and load the next 16 bytes into the XMM register. Finally I have to compare the remaining number of bytes. If the length of the string was a multiple of 16, then we are done and both strings were equal. In case that there there up to 15 bytes more, I make the same comparison again, line 48, and create a bit mask of difference of the rest of the string. The XMM comparison of equalness does not know that after the trailing ‘\0’ the string has already ended and so it might detect different char. So finally I compare the calculated bitmask from line 50 with the help of a pre defined mask, line 51, and check for equalness.

Smaller Operator

The original implementation of the smaller operator used the ::strcmp function.

I wanted to see if it possible to improve the performance by using SSE 4.2 instructions. So I have written a  UnitTest with the GoogleTest framework to ensure the correctness of the new implementation and measure possible improvements.

I wanted to use here is the pcmpistri instruction. It can compare 16 chars in parallel and can detect the terminating zero. So there are three possibilities to do it: 1) An external assembly, 2) inline assembler or 3) intrinsics. An external assembly has the advantage that one can use different versions for an x86 and an x64 build. But it would introduce an additional function call. Inline assembler would avoid the additional function call but would not work for an x64 build (at least for the Visual Studio 20xx compiler family). So I decided to go for intrinsics. A way of optimizations I have never used before. In the past I always used inline assembler.

So here is my result:

At the beginning in line 11 I check if the current CPU supports the enhanced instruction set. At the end of this article is an example implementation of such a check.

At line 13 and 14 I create local pointer to both strings with an initial offset of -16. At the begin of the loop this offset is added back so that the comparison starts at the correct address. Increasing the offset at the beginning eases the end of the loop.

At 15, result is used for the returning value of the pcmpistri/_mm_cmpistri instruction. This instruction sets as well the Carry Flag and Zero Flag. These are then later stored into eFlags (16).

Two mask flags are defined at line 17 and 18 for later checks.

At line 21 and 22 the string pointers are increased by 16 chars.

16 chars are copied to an xmm register at line 23 from the content of this->_value. At line 24 the same is done for other._value.

At line 25 the actual comparisons takes place. With the 3rd parameter of the instruction, here 0x18, it is defined how the instruction should work. See the referenced documentation for further details. Originally pcmpistri stores its result in the ecx register. But the intrinsic instruction returns it into the given result variable. (Initially I used the _mm_cmpistrc instruction and wondered why I did not get my expected results.)

Then it is necessary to store the status flags at line 27, because I have to jump depending of the content of the Carry and Zero Flag. (I searched for an intrinsic that performs a jump based like the assembly instruction ja to a defined label. But I did not find it. If there is such a way with intrinsics I would appreciate a comment.)

So I continue with the loop until neither the Carry flag nor the Zero flag is set (line 28).

After the loop either both strings are really equal or in the last checked 16 chars is a difference. So if the Zero flag is not set then both strings are equal and the operator must return false.

Nearly at the end we use the result – it contains the index of the 1st different char. So we add it to the string pointer (lines 36, 36) and finally compare those chars (line 38).

Here are the results for the performance improvements. Each measurements was repeated 10 times. Each instruction was executed 10,000,000 times. (The vertical scale is in ms.) Because the IntelCore 2 Duo CPU does not support the SSE4.2 instruction set, there is no measurement for the smaller operator.



So the results speak for themselves 😉


Here is the announced function that checks if the local CPU supports the SSE42 instruction set.

  2 Responses to “Optimizing several C++ string operators with intrinsics”

  1. Avatar

    Hi Felix,

    thats a nice article about an optimized string class and a very good example why it is a good idea to implement a specialized string class. The general purpose ones are like SUVs, there get you (almost) everywhere but they will certainly lose against a Formula 1 car on a race track.

    There are two things I noticed immediatly though:

    a.) have you tried to align the class?

    Eventhough the Intel Core i3/5/7 can handle unaligned memory access very well (unlike its predecessors) there is still the possiblity of cache line breaks.

    __declspec(align(16)) class MyString should do the trick. This will allow you to use the aligned SSE load and store instructions.

    Aligning to a cache line (32) may also be worth trying but bear in mind that this may cause cache bloat depending on the number and size of the string objects.

    b.) I dont like the supportsSSE42() function all that much.

    Yes, I am aware that todays CPUs can do such a check almost for free (given speculative and out-of-order execution) the problem is that CPUID is one of the serializations instructions that specifictly prohibit out-of-order execution.

    It is also unlikly that the support for SSE 4.2 instructions varies during runtime ;-).

    If you are using the Intel compiler (or clang or gcc) you could use the CPU dispatch functionality to provide differend implementations for different CPU types which will then be chosen by the CRT at runtime. Using the MS compiler you could fallback to a strategy style approach and determine the CPU type at the program start and insert the appropriate strategy into the string class.

    As for your actual implementation, due to a lack of time I have not tried it yet but I will play with it and get back to you.


    • Felix Petriconi

      Hi Lutz,

      the supportsSSE42() is not the best way, I know. In a later version I will check for a CPU dispatch functionality. So far Visual Studio does not support it, but I think there should be ways to do it manually in the static initialization phase before main.
      I have not tried so far a member alignment.
      Within the next days I have to bring my assembly code of the ShearWarp to intrinsics, because we go finally to 64bit 😉


Sorry, the comment form is closed at this time.