# GPU versus CPU for pixel graphics

After having gained a bit of experience with GPU shader programming during my Fragmentarium development, a natural question to ask is: how fast are these GPU’s?

This is not an easy question to answer, and it depends on the specific application. But I will try to give an answer for the kind of systems that I’m interested in: pixel graphics systems, where each pixel can be calculated independently of the others, such as raytraced 3D fractals.

Lets take my desktop computer, a fairly standard desktop machine, as an example. It is equipped with Nvidia Geforce 9800GT GPU @ 1.5 GHz, and a Intel Core 2 Quad Q8200 @ 2.33GHz.

How many processing unit are there?

Number of processing units (CPU): 4 CPU cores
Number of processing units (GPU): 112 Shader units

Based on these numbers, we might expect the GPU to be a factor of 28x times faster than the CPU. Of course, this totally ignores the efficiency and operating speed of the processing units. Lets try looking at the processing power in terms of maximum number of floating-point operations per second instead:

## Theoretical Peak Performance

Both Intel and Nvidia list the GFLOPS (billion floating point operations per second) rating for their products. Intel’s list can be found here, and Nvidia’s here. For my system, I found the following numbers:

Performance (CPU): 37.3 GFLOPS
Performance (GPU): 504 GFLOPS

Based on these numbers, we might expect the GPU to be a factor of 14x times faster than the CPU. But what do these numbers really mean, and can they be compared? It turns out that these number are obtained by multiplying the processor frequency by the maximum number of instructions per clock cycle.

For the CPU, we have four cores. Now, when Intel calculate their numbers, they do it based on the special 128-bit SSE registers on every modern Pentium derived CPU. These extensions make it possible to handle two double precision floating point, or four single precision floating point numbers per clock cycle. And in fact there exists a special instruction – the MAD, or Multiply-Add, instruction – which allows for two arithmetic operations per clock cycle on each element in the SSE registers. These means Intel assume 4 (cores) x 2 (double precision floats) x 2 (MAD instructions) = 16 instructions per clock cycle. This gives the theoretical peak performance stated above:

Performance (CPU): 2.33 GHz * 4 * 2 * 2 = 37.3 GFLOPS (double precision floats)

What about the GPU? Here we have 112 independent processing units. On the GPU architecture an even more benchmarking-friendly instruction exists: the MAD+MUL which combines two multiplies and one addition in a single clock cycle. This means Nvidia assumes 112 (cores) * 3 (MAD+MUL instructions) = 336 instructions per clock cycle. Combining this with a stated processing frequency of 1.5 GHz, we arrive at the number stated earlier:

Performance (GPU): 1.5 GHz * 112 * 3 = 504 GFLOPS (single precision floats)

But wait… Nvidia’s number are for single precision floats – the Geforce 8800GT does not even support double precision floats. So for a fair comparison we should double Intel’s number, since the SSE extensions allows four simultaneous single precision numbers to be processed instead of two double precision floats. This way we get:

Performance (CPU): 2.33 GHz * 4 * 4 * 2 = 74.6 GFLOPS (single precision floats)

Now, using this as a guideline, we would expect my GPU to be a factor of 6.8x faster than my CPU. But we have some pretty big assumptions here: for instance, not many CPU programmers would write SSE-optimized code – and is a modern C++ compiler powerful enough to automatically take advantage of them anyway? And how often is the GPU able to use the three operation MUL+MAD instruction?

## A real-world experiment

To find out I wrote a simple 2D Mandelbrot system and benchmarked it on the CPU and GPU. This is really the kind of computational tasks that I’m interested in: it is trivial to parallelize and is not memory-intensive, and the majority of executed code will be floating point arithmetics. I did not try to optimize the C++ code, because I wanted to see if the compiler was able to perform some SSE optimization for me. Here are the execution times:

13941 ms – CPU single precision (x87)
13941 ms – CPU double precision (x87)
10535 ms – CPU single precision (SSE)
11367 ms – CPU double precision (SSE)
424 ms – GPU single precision

(These numbers have some caveats – I did perform the tests multiple times and discarded the first few runs, but the CPU code was only single-threaded – so I assumed the numbers would scale perfectly and divided the execution times by four. Also, I verified by checking the generated assembly code, that SSE instructions indeed were used for the core Mandelbrot loop, when they were enabled.).

There are a couple of things to notice here: first, there is no difference between single and double precision on the CPU. This is as could be expected for the x87 compiled code (since the x87 defaults to 80-bit precision anyway), but for the SSE version, we would expect a double up in speed. As can be seen, the SSE code is really not very much more efficient the the x87 code – which strongly suggests that the compiler (here Visual Studio C++ 2008) is not very good at optimizing for SSE.

So for this example we got a factor of 25x speedup by using the GPU instead of the CPU.

## “Measured” GFLOPS

Another questions is how this example compares to the theoretical peak performance. By using Nvidia’s Cg SDK I was able to get the GPU assembly code. Since I now could count the number of instruction in the main loop, and I knew how many iterations were performed, I was able to calculate the actual number of floating point operations per second:

GPU: 211 (Mandel)GFLOPS
CPU: 8.4 (Mandel)GFLOPS*

(*The CPU number was obtained by assuming the number of instructions in the core loop was the same as for the GPU: in reality, the CPU disassembly showed that the simple loop was partially unrolled to more than 200 lines of very complex assembly code.)

Compared to the theoretical maximum numbers of 504 GFLOPS and 74.6 GFLOPS respectively, this shows the GPU is much closer to its theoretical limit than the CPU.

## GPU Caps Viewer – OpenCL

A second test was performed using the GPU Caps Viewer. This particular application includes a 4D Quaternion Julia fractal demo in OpenCL. This is interesting since OpenCL is a heterogeneous platform – it can be compiled to both CPU and GPU. And since Intel just released an alpha version of their OpenCL SDK, I could compare it to Nvidia’s SDK.

The results were interesting:

Intel OpenCL: ~5 fps
Nvidia OpenCL: ~35 fps

(The FPS did vary through the animation, so these numbers are not very accurate. There were no dedicated benchmark mode.)

This suggest that Intel’s OpenCL compiler is actually able to take advantage of the SSE instructions and provides a highly optimized output. Either that, or Nvidia’s OpenCL implementation is not very efficient (which is not likely).

The OpenCL based benchmark showed my GPU to be approximately 7x times faster than my CPU. Which is exactly the same as predicted by comparing the theoretical GFLOPS values (for single precision).

## Conclusion

For normal code written in a high-level language like C or GLSL (multithreaded, single precision, and without explicit SSE instructions) the computational power is roughly equivalent to the number of cores or shader units. For my system this makes the GPU a factor of 25x faster.

Even though the CPU cores have higher operating frequency and in principle could execute more instructions via their SSE registers, this does not seem be fully utilized (and in fact, compiling with and without SSE optimization did not make a significant difference, even for this very simple example).

The OpenCL example tells another story: here the measured performance was proportional to the theoretical GFLOPS ratings. This is interesting since this indicate, that OpenCL could also be interesting for CPU-applications.

One thing to bear in mind is, that the examples tested here (the Mandelbrot and 4D Quaternion Julia) are very well-suited for GPU execution. For more complex code, with conditional branching, double precision floating point operations, and non-coalesced memory access, the CPU is much more efficient than the GPU. So for a desktop computer such as mine, a factor of 25x is probably the best you can hope for (and it is indeed a very impressive speedup for any kind of code).

It is also important to remember that GPU’s are not magical devices. They perform operations with a theoretical peak performance typically 5-15 times larger than a CPU. So whenever you see these 1000x speed up claims (e.g. some of the CUDA showcases), it is probably just an indication of a poor CPU implementation.

But even though the performance of GPU’s may be somewhat exaggerated you can still get a tremendous speedup. And GPU interfaces such as GLSL shaders are really simple to use: you do not need to deal explicitly with threads, you have built-in vectors and matrices, and you can compile GLSL code dynamically, during run-time. All features which makes GPU programming nearly ideal for exploring pixel graphic systems.

# Fragmentarium – an IDE for exploring fractals and generative systems on the GPU.

As I mentioned in my previous post, I started experimenting with GLSL raytracing a couple of months ago, and I’m now ready to release the first version of Fragmentarium, an open source, cross-platform IDE for exploring pixel based graphics on the GPU.

It was mainly created for exploring Distance Estimated systems, such as Mandelbulbs or Kaleidoscopic IFS, but it can also be used for 2D systems.

Fragmentarium is inspired by Adobe’s Pixel Bender, but uses pure GLSL, and is specifically created with fractals and generative systems in mind. Besides Pixel Bender, there are also other, more specialized, GPU fractal applications out there, such as Boxplorer and Tacitus, but I wanted something code-centric, where I quickly can modify code and use code in a more modular manner.

Features:

• Multi-tabbed IDE, with GLSL syntax highlighting
• Modular GLSL programming – include other fragments
• User widgets to manipulate parameter settings.
• Different ‘mouse to GLSL’ mapping schemes (2D and 3D)
• Includes raytracer for distance estimated systems
• Many examples including Mandelbulb, Mandelbox, Kaleidoscopic IFS, and Julia Quaternion

http://syntopia.github.com/Fragmentarium/

There are binaries for Windows, but for now you’ll have to build it yourself for Mac and Linux. You will need a graphics card capable of running GLSL (any reasonably moderne discrete card will do).

Here is a screenshot:

Fragmentarium screenshot (click to enlarge).

There’s also a gallery at Flickr: Fragmentarium Group

Fragmentarium is not a mature application yet. Especially the camera handling needs some work in the next versions – camera settings are not saved as part of the parameters, no field-of-view control and you often have to compensate for clipping. For future versions I also plan arbitrary resolution renders (tile based rendering) and animations.

There are probably also many small quirks and bugs – I’ve had several problems with ATI drivers, which seems to be much more strict than Nvidias.

# Folding Space II: Kaleidoscopic Fractals

Another type of interesting 3D fractal has appeared over at fractalforums.com: the Kaleidoscopic 3D fractals, introduced in this thread, by Knighty.

Once again these fractals are defined by investing the convergence properties of a simple function. And similar to the Mandelbox, the function is built around the concept of folds. Geometrically, a fold is simply a conditional reflection: you reflect a point in a plane, if it is located on the wrong side of the plane.

It turns out that just by using plane-folds and scaling, it is possible to create classic 3D fractals, such as the Menger cube and the Sierpinsky tetrahedron, and even recursive versions of the rest of the Platonic solids: the octahedron, the dodecahedron, and the icosahedron.

Example of a recursive dodecahedron

The kaleidoscopic fractals introduce an additional 3D rotation before and after the folds. It turns out that these perturbations introduce a rich variety of interesting and complex structures.

I’ve followed the thread and implemented most of the proposed systems by modifying Subblue’s Pixel Bender scripts.

Below are some of my images:

## The Menger Sponge

My first attempts. Pixel Bender kept crashing on me, until I realized that there is a GPU timeout in Windows Vista (read this for a solution).

## The Sierpinsky

Then I moved on to the Sierpinsky. The sequence below shows something characteristic for these fractals: the first slightly perturbed variations look artificial and synthetic, but when the system is distorted, it becomes organic and alive.

## The Icosahedron

I also tried the octahedron and dodecahedron, but my favorite is the icosahedron. Especially knighty’s hollow variant.

## Arbitrary Planes

One nice thing about these systems is, that you do not necessarily need to derive a complex distance estimator – you can also just modify the distance estimator code, and see what happens. These last two images were constructed by modifying existing distance estimators.

It will be interesting to see where this is going.

Many fascinating 3D fractals have appeared at fractalforums.com over the last few weeks. And GPU processing now makes it is possible to explore these systems in real-time.

For some time I’ve been wanting to play around with pixel (fragment) shaders, but I couldn’t find a proper playground.

Then I stumbled upon Shader Toy, by Inigo Quilez (whom I’ve mentioned several times on this blog). A couple of things make Shader Toy stand out:

It runs inside your browser. It uses the emerging WebGL standard, which is JavaScript bindings for OpenGL (ES) 2.0. OpenGL can be used directly inside a Canvas HTML element, including support for custom shaders. As Shader Toy demonstrates, this makes it possible to do some very impressive stuff, such as real-time GPU-accelerated raytracing inside an element on a web page.

The examples are great. While Shader Toy itself is mostly a thin wrapper around the WebGL functionality, the great thing about it is the example shaders: 2D fractals and Demo Scene effects, but also complex examples like the Slisesix 4K demo, and examples of raytracing, and complex fractals, like the Quaternion Julia set, and the Mandelbulb.

The only problem with WebGL is, that it is not supported by the current generation of browsers.

The good news is that the nightly builds of Firefox, Safari (WebKit), and Chromium (Google Chrome) all support it, and are quite easy to install: this is a good place for more information. If you use the Chromium builds, you don’t have to worry about messing up your existing browser configuration – the nightly builds are standalone versions and can be run without installation.

There are lots of complex shader tools out there: for instance, NVIDIAs FX Composer, AMDs Rendermonkey, TyphoonLabs OpenGL Shader Designer, and Lumina, but Shader Toy makes it very easy to get started with shaders. And it provides a rare insight into how those amazing 4K demos were made.

# Mandelbulb Implementations

Several implementations have appeared since the Mandelbulb surfaced a couple of months ago.

The first public GPU implementation I know of was created by ‘cbuchner1’. It is based on a sample from NVIDIAs OptiX SDK, and features anaglyphic 3D, ambient occlusion, phong shading, reflection, and environment maps. It can be downloaded here (Windows only and requires a forum signup).

Very interestingly this binary runs on my laptops modest GeForce 8400M. I am a bit puzzled about this – NVIDIA state that the OptiX SDK requires a Quadro or a Tesla card, and I am not able to run the Julia OptiX demo, that cbuchner1s app is derived from.

Subblue has also created a Mandelbulb implementation, released as a Pixel Bender script and a Quartz composer plugin. A number of interesting customizations makes this my favorite choice: it is possible to explore negative and fractional powers, switch to Julia sets, and the lightning options can be fine-tuned. The only drawback is that Pixel Bender does not make it possible to directly rotate, zoom, and translate the camera – you have to rely on sliders for that.

Example created by Subblue.

Iñigo Quílez has also created a GPU implementation, but unfortunately he has not released any code yet. A couple of videos are available on Youtube, though: Part 1, Part 2, Part 3.

Quilez also discovered this intimate connection between the Shroud of Turin and the Mandelbulb.

The MathFuncRenderer also has a Mandelbulb implementation. I had a few quirks with this one – I had to install OpenAL, and the UI was quite non-responsive, but this may be due to my graphics card.

Another very interesting implementation is the GigaVoxels Mandelbulb: Whereas most implementations cast rays and use a distance estimator to speed up the ray marching, GigaVoxels use voxels stored into an Octree, which is populated on-the-fly.

For other implementations keep an eye on Fractal Forums Mandelbulb Implementation category.

# Mandelbulb

A lot of sites have reported that a new, interesting 3D version of the Mandelbrot set has been discovered. The Mandelbulb has aesthetic qualities similar to Quaternion-Julia sets, but seems more diverse and suited for exploration.

“Cave of Lost Secrets” from Skytopia.

Skytopia has a great overview complete with many stunning images.

A good way to view the basic structure is this 56 Megapixel render from Skytopia (using the Seadragon viewer – requires Silverlight):

As of now, I do not know of any released software capable of generating Mandelbulbs, but it probably won’t be long:

Recent posts by Iñigo Quílez (who produced the Kindernoiser Quaternion-Julia set GPU renderer) indicate that he is very to close to completing a fast GPU implementation. These posts also include the basic source-code, which I believe should make it possible to port to other targets, for instance Pixel Bender. Apparently Quílez has cooked up a distance estimator, and a fake ambient occlusion scheme (based on orbit traps) for these Mandelbulbs, which sounds very promising.

# Quaternion Julia sets and GPU computation.

Subblue has released another impressive Pixel Bender plugin, this time a Quaternion Julia set renderer.

Quaternions are extensions of the complex numbers with four independent components. Quaternion Julia sets still explore the convergence of the system z ← z2 + c, but this time z and c are allowed to be quaternion-valued numbers. Since quaternions are essentially four-dimensional objects, only a slice (the intersection of the set with a plane) of the quaternion Julia sets is shown.

Quaternion Julia sets would be very time consuming to render if it wasn’t for a very elegant (and surprising) formula, the distance estimator, which for any given point gives you the distance to the closest point on the Julia Set. The distance estimator method was first described in: Ray tracing deterministic 3-D fractals (1989).

My first encounter with Quaternion Julia sets was Inigo Quilez’ amazing Kindernoiser demo which packed a complete renderer with ambient occlusion into a 4K executable. It also used the distance estimator method and GPU based acceleration. If you haven’t visited Quilez’ site be sure to do so. It is filled with impressive demos, and well-written tech articles.

Transfigurations (another Quaternion Julia set demo) from Inigo Quilez on Vimeo.

In the 1989 Quaternion Julia set paper, the authors produced their images on an AT&T Pixel Machine, with 64 CPU’s each running at 10 megaFLOPS. I suspect that this was an insanely expensive machine at the time. For comparison, the relatively modest NVIDIA GeForce 8400M GS in my laptop has a theoretical maximum processing rate of 38 gigaFLOPS, or approximately 60 times that of the Pixel Machine. A one megapixel image took the authors of the 1989 paper 1 hour to generate, whereas Subblues GPU implementation uses ca. 1 second on my laptop (making it much more efficient than what would have been expected from the FLOPS ratio).

## GPU Acceleration and the future.

These days there is a lot of talk about using GPUs for general purpose programming. The first attempts to use GPUs to speed up general calculations relied on tricks such as using pixel shaders to perform calculations on data stored in texture memory, but since then several API’s have been introduced to make it easier to program the GPUs.

NVIDIAs CUDA is currently by far the most popular and documented API, but it is for NVIDIA only. Their gallery of applications demonstrates the diversity of how GPU calculations can be used. AMD/ATIs has their competing Stream API (formerly called Close To Metal) but don’t bet on this one – I’m pretty sure it is almost abandoned already. Update: as pointed out in the comments, the new ATI Stream 2.0 SDK will include ATIs OpenCL implemention, which for all I can tell is here to stay. What I meant to say was, that I don’t think ATIs earlier attempts at creating a GPU programming interface (including the Brook+ language) are likely to catch on.

Far more important is the emerging OpenCL standard (which is being promoted in Apples Snow Leopard, and is likely to become a de facto standard). Just as OpenGL, it is managed by the Khronos group. OpenCL was originally developed by Apple, and they still own the trademark, which is probably why Microsoft has chosen to promote their own API, DirectCompute. My guess is that CUDA and Brook+ will slowly fade away, as both OpenCL and DirectCompute will come to co-exist just the same way as OpenGL and Direct3D do.

For cross-platform development OpenCL is therefore the most interesting choice, and I’m hoping to see NVIDIA and AMD/ATI release public drivers for Windows as soon as possible (as of now they are in closed beta versions).

GPU acceleration could be very interesting from a generative art perspective, since it suddenly becomes possible to perform advanced visualization, such as ray-tracing, in real-time.

A final comment: a few days ago I found this quaternion Julia set GPU implementation for the iPhone 3GS using OpenGL ES 2.0 programmable shaders. I think this demonstrates the sophistication of the iPhone hardware and software platform – both that a hand-held device even has a programmable GPU, but also that the SDK is flexible enough to make it possible to access it.

# Fractal Explorer Plugin

In July Subblue released another Pixel Blender plugin, called the Fractal Explorer Plugin – for exploring Julia sets and fractal orbit maps. I didn’t get around to try it out until recently, but it is really a great tool for exploring fractals.

Most people have probably seen examples of Julia and Mandelbrot sets – where the convergence properties of the series generated by repeated application of a complex-valued function is investigated.

The most well-known example is the iteration of the function z ← z2+c. The Mandelbrot set is created by plotting the convergence rate for this function while c varies over the complex plane. Likewise, the Julia set is created for a fixed c while varying the initial z-value over the complex plane.

Glynn1 by Subblue (A Julia-set where an exponent of 1.5 is used).

Where ordinary Julia and Mandelbrot sets only take into account whether the series created by the iterated function tends towards infinity (diverges) or not, fractal orbits instead uses another image as input, and checks whether the complex number series generated by the function hits a (non-transparent) pixel in the source image. This allows for some very fascinating ‘fractalization’ of existing images.

A fractal orbit showing a highly non-linear transformation of a Mondrian picture.

Subblue suggests starting out using Ernst Haeckel beautiful illustrations from the book Artforms of Nature, and he has put up a small gallery with some great examples:

An example of an orbit mapped Ernst Haeckel image.

To try out Subblue’s filter, download the Pixel Blender SDK and load his kernel filter and an input image of choice. It is necessary to uncheck the “Build | Turn On Flash Player Warnings and Errors” menu item in order to start the plugin. On my computer I also often experience that the Pixel Blender SDK is unable to detect and initialize my GPU – it sometimes help to close other programs and restart the application. The filter executes extremely fast on the GPU – often with more than 100 frames per second, making it easy to interactively explore and discover the fractals.

As I final note, I implemented a fractal drawings routine myself in Structure Synth (version 1.0) just for fun. It is implemented as a hidden easter egg, and not documented at all, but the code belows shows an example of how to invoke it:

#easter Size: 800x800 MaxIter: 150 Term: (-0.2,0) Term: 1*Z^1.5 BreakOut: 2 View: (-0.0,0.2) -> (0.7,0.9) 

Arguable, this code is not very optimized (it is possible to add an unlimited number of terms, making the function evaluation somewhat slow), but still it takes seconds to calculate an image making it more than a hundred times slower than the Pixel Blender GPU solution.