Productive Waste of Time: Gradients and Altivec

I was writing some gradient code and while writing the gradient function it occurred to me that it could be vectorized. I was doing the same operation in a loop to two sets of 4 floats (the red, green, blue and alpha values that need to be calculated for the gradient). The 4 floats fit perfectly into a 128 bit vector. My reason for doing this was not because of any bottleneck observed in Shark; it was just a flimsy excuse to play with Altivec for the first time. Nonetheless, if there was a noticeable performance increase, all the better.

In between a wedding, entertaining friends who were staying with me and getting a beta release of Hazel out the door, I read a little on Altivec and cooked up a little test program. You can download the source below. It’s PowerPC with Altivec only. I was sloppy with the #ifdefs but compiling the test without Altivec support is a bit pointless. I don’t have an Intel machine so I didn’t write an SSE version. For you Intel readers at home, feel free to add the appropriate SSE code (and let me know what you come up with). You’ll find my guess at what the SSE version is supposed to look like commented out in the code (but don’t trust it; I don’t exactly know what I’m doing here).

What does the test show? Well, at least on the G4 and G5 machines I have available, performance is roughly the same.

Of course, there are two ways to interpret this:

My test sucks
(obligatory remarks: “I didn’t know they taught programming at clown college.”, “My [insert feeble relative] can code better than you”)

A very likely possibility. This is my first foray into Altivec and I just started delving into Core Graphics gradient functions last week. As far as my Altivec code goes, I am still quite fuzzy on whether I should have used something like vec_ld or vec_splat to turn the multiplier into a vector though I’m pretty sure that using the float[]/vFloat union fulfills the byte alignment requirements (though it’s possible that it’s slower). I’m guessing it’s the difference between sticking it in a register and having it in memory. If any SIMD experts could educate me on this, I’d appreciate it.

Vectorization is not effective in this scenario

I’m sure those much more experienced with Altivec can explain this without resorting to a tactic that we like to call “making stuff up”. Me, I am going make stuff up. So here goes: I’m only computing one vector at a time and I’m doing all these scalar to vector (and back) conversions. My guess is that the overhead outweighs the reduced instruction count in the computation. If the CoreGraphics functions were structured such that it gave you all the values at once (or at least in chunks) instead of at each step of the gradient, then I could see an argument for vectorization made here.

So for now the result is inconclusive until I can get someone who knows what they’re doing to either verify or refute my test. Try it for yourself and let me know (especially if your results differ from mine). And please point out any deficiencies in my implementation (though comments on coding style can go to /dev/hell).

GradientTest (PowerPC with Altivec only)

Category: Cocoa, Downloads, OS X, Programming, Quartz 2 comments »

2 Responses to “Productive Waste of Time: Gradients and Altivec”

  1. alexr

    The faster way to do this function would be to either know that in is aligned and do one load, or do do two loads and a vec_perm(vec1,vec2,vec_lvsl(0,in)) to get the misaligned data. Loading four floats and storing them to memory, then reloading as a vector is slower.

    You need to “Show Assembly Code” from Xcode to see that the compiler is generating the instructions you expect. Pointer aliasing and other extra memory operations become evident this way.

    As you’ve noted, the real problem here is that the overhead of the callback mechanism swamps the work done in the function. If you can be 100% sure that CG calls this function back in the same thread as you, you could compile the function to not update VRSAVE in the prolog/epilog and then set VRSAVE to 0xFFFFFFFF before calling CG to run the function. That would save a few more instructions, but the overhead of the callbacks is probably still too high.

  2. mr_noodle

    Thanks for the very informative comment. So it seems I was right on both counts. My test was sub-optimal but in the end it probably doesn’t matter. I probably won’t be spending much more time on this (the VRSAVE suggestion is beyond the scope of my knowledge on the subject at this point), but in the end, it’s always good when I can learn from my failed experiments.


Leave a Reply



Back to top