Category Archives: Fragmentarium

Hybrid 3D Fractals

A lot of great images have been made of the Mandelbulb, the Mandelbox, and the various kaleidoscopic IFS’s (the non-platonic non-solids). And it turns out that by combining these formulas (and stirring a few assorted functions into the mix), a variety of new, amazing, and surprising forms emerge.

I’m currently working on making it easier to combine different formulas in Fragmentarium – but until I get something released, here is a collection of images and movies created by Mandelbulb 3D (Windows, free) and Mandelbulber (Windows, free, open-source), that illustrates the beauty and diversity of these hybrid systems. Be sure to view the large versions by following the links. The images were all found at Fractal Forums.

Videos


Buddhi – Mandelbox and Flying Lights


Jérémie Brunet (Bib) – Weird Planet II


Jérémie Brunet (Bib) – Like in a dream II

Images


Lenord – Pray your Gods


Tomot – It’s a jungle out there


Lenord – J.A.R.


MarkJayBee – Security Mechanisms


Fractal00 – Alien Stones


Kr0mat1k – Restructuration


BrutalToad – Jülchen

Fragmentarium v0.8

I’ve released a new build of Fragmentarium with some much needed updates, including better camera control, high resolution renders, and animation.

New features in version 0.8:

  • The 3D camera has been rewritten: it is now a “first-person”, pinhole camera (like Boxplorer and Fractal Lab), and is controllable using mouse and keyboard. Camera view can now be saved together with other settings.
  • Arbitrary resolution renderings (using tile based rendering – the GPU won’t time out).
  • Preview modes (renders to FBO with lower resolution and rescales).
  • ‘Tile preview’ for previewing part of high-resolution renders.
  • Animation controller (experimental: no keyframes yet, you must animate using the system supplied ‘time’ variable. Animation is output as a sequence of still images).
  • Presets (group parameters settings and load them into a dropbox)
  • New fractals: QuaternionMandelbrot4D, Ducks, NewMenger.
  • Improved raytracer: dithering, fog, new coloring schemes.

Download it here: http://syntopia.github.com/Fragmentarium/get.html


High-resolution render of a 4D Quaternion Mandelbrot (click for large)


Samuel Monnier’s ‘Ducks’ Fractal has been added.


Mandelbrot/Julia type system now with embedded Mandelbrot map.

Fragmentarium test animation – click here for higher resolution.

GPU versus CPU for pixel graphics

After having gained a bit of experience with GPU shader programming during my Fragmentarium development, a natural question to ask is: how fast are these GPU’s?

This is not an easy question to answer, and it depends on the specific application. But I will try to give an answer for the kind of systems that I’m interested in: pixel graphics systems, where each pixel can be calculated independently of the others, such as raytraced 3D fractals.

Lets take my desktop computer, a fairly standard desktop machine, as an example. It is equipped with Nvidia Geforce 9800GT GPU @ 1.5 GHz, and a Intel Core 2 Quad Q8200 @ 2.33GHz.

How many processing unit are there?

Number of processing units (CPU): 4 CPU cores
Number of processing units (GPU): 112 Shader units

Based on these numbers, we might expect the GPU to be a factor of 28x times faster than the CPU. Of course, this totally ignores the efficiency and operating speed of the processing units. Lets try looking at the processing power in terms of maximum number of floating-point operations per second instead:

Theoretical Peak Performance

Both Intel and Nvidia list the GFLOPS (billion floating point operations per second) rating for their products. Intel’s list can be found here, and Nvidia’s here. For my system, I found the following numbers:

Performance (CPU): 37.3 GFLOPS
Performance (GPU): 504 GFLOPS

Based on these numbers, we might expect the GPU to be a factor of 14x times faster than the CPU. But what do these numbers really mean, and can they be compared? It turns out that these number are obtained by multiplying the processor frequency by the maximum number of instructions per clock cycle.

For the CPU, we have four cores. Now, when Intel calculate their numbers, they do it based on the special 128-bit SSE registers on every modern Pentium derived CPU. These extensions make it possible to handle two double precision floating point, or four single precision floating point numbers per clock cycle. And in fact there exists a special instruction – the MAD, or Multiply-Add, instruction – which allows for two arithmetic operations per clock cycle on each element in the SSE registers. These means Intel assume 4 (cores) x 2 (double precision floats) x 2 (MAD instructions) = 16 instructions per clock cycle. This gives the theoretical peak performance stated above:

Performance (CPU): 2.33 GHz * 4 * 2 * 2 = 37.3 GFLOPS (double precision floats)

What about the GPU? Here we have 112 independent processing units. On the GPU architecture an even more benchmarking-friendly instruction exists: the MAD+MUL which combines two multiplies and one addition in a single clock cycle. This means Nvidia assumes 112 (cores) * 3 (MAD+MUL instructions) = 336 instructions per clock cycle. Combining this with a stated processing frequency of 1.5 GHz, we arrive at the number stated earlier:

Performance (GPU): 1.5 GHz * 112 * 3 = 504 GFLOPS (single precision floats)

But wait… Nvidia’s number are for single precision floats – the Geforce 8800GT does not even support double precision floats. So for a fair comparison we should double Intel’s number, since the SSE extensions allows four simultaneous single precision numbers to be processed instead of two double precision floats. This way we get:

Performance (CPU): 2.33 GHz * 4 * 4 * 2 = 74.6 GFLOPS (single precision floats)

Now, using this as a guideline, we would expect my GPU to be a factor of 6.8x faster than my CPU. But we have some pretty big assumptions here: for instance, not many CPU programmers would write SSE-optimized code – and is a modern C++ compiler powerful enough to automatically take advantage of them anyway? And how often is the GPU able to use the three operation MUL+MAD instruction?

A real-world experiment

To find out I wrote a simple 2D Mandelbrot system and benchmarked it on the CPU and GPU. This is really the kind of computational tasks that I’m interested in: it is trivial to parallelize and is not memory-intensive, and the majority of executed code will be floating point arithmetics. I did not try to optimize the C++ code, because I wanted to see if the compiler was able to perform some SSE optimization for me. Here are the execution times:

13941 ms – CPU single precision (x87)
13941 ms – CPU double precision (x87)
10535 ms – CPU single precision (SSE)
11367 ms – CPU double precision (SSE)
424 ms – GPU single precision

(These numbers have some caveats – I did perform the tests multiple times and discarded the first few runs, but the CPU code was only single-threaded – so I assumed the numbers would scale perfectly and divided the execution times by four. Also, I verified by checking the generated assembly code, that SSE instructions indeed were used for the core Mandelbrot loop, when they were enabled.).

There are a couple of things to notice here: first, there is no difference between single and double precision on the CPU. This is as could be expected for the x87 compiled code (since the x87 defaults to 80-bit precision anyway), but for the SSE version, we would expect a double up in speed. As can be seen, the SSE code is really not very much more efficient the the x87 code – which strongly suggests that the compiler (here Visual Studio C++ 2008) is not very good at optimizing for SSE.

So for this example we got a factor of 25x speedup by using the GPU instead of the CPU.

“Measured” GFLOPS

Another questions is how this example compares to the theoretical peak performance. By using Nvidia’s Cg SDK I was able to get the GPU assembly code. Since I now could count the number of instruction in the main loop, and I knew how many iterations were performed, I was able to calculate the actual number of floating point operations per second:

GPU: 211 (Mandel)GFLOPS
CPU: 8.4 (Mandel)GFLOPS*

(*The CPU number was obtained by assuming the number of instructions in the core loop was the same as for the GPU: in reality, the CPU disassembly showed that the simple loop was partially unrolled to more than 200 lines of very complex assembly code.)

Compared to the theoretical maximum numbers of 504 GFLOPS and 74.6 GFLOPS respectively, this shows the GPU is much closer to its theoretical limit than the CPU.

GPU Caps Viewer – OpenCL

A second test was performed using the GPU Caps Viewer. This particular application includes a 4D Quaternion Julia fractal demo in OpenCL. This is interesting since OpenCL is a heterogeneous platform – it can be compiled to both CPU and GPU. And since Intel just released an alpha version of their OpenCL SDK, I could compare it to Nvidia’s SDK.

The results were interesting:

Intel OpenCL: ~5 fps
Nvidia OpenCL: ~35 fps

(The FPS did vary through the animation, so these numbers are not very accurate. There were no dedicated benchmark mode.)

This suggest that Intel’s OpenCL compiler is actually able to take advantage of the SSE instructions and provides a highly optimized output. Either that, or Nvidia’s OpenCL implementation is not very efficient (which is not likely).

The OpenCL based benchmark showed my GPU to be approximately 7x times faster than my CPU. Which is exactly the same as predicted by comparing the theoretical GFLOPS values (for single precision).

Conclusion

For normal code written in a high-level language like C or GLSL (multithreaded, single precision, and without explicit SSE instructions) the computational power is roughly equivalent to the number of cores or shader units. For my system this makes the GPU a factor of 25x faster.

Even though the CPU cores have higher operating frequency and in principle could execute more instructions via their SSE registers, this does not seem be fully utilized (and in fact, compiling with and without SSE optimization did not make a significant difference, even for this very simple example).

The OpenCL example tells another story: here the measured performance was proportional to the theoretical GFLOPS ratings. This is interesting since this indicate, that OpenCL could also be interesting for CPU-applications.

One thing to bear in mind is, that the examples tested here (the Mandelbrot and 4D Quaternion Julia) are very well-suited for GPU execution. For more complex code, with conditional branching, double precision floating point operations, and non-coalesced memory access, the CPU is much more efficient than the GPU. So for a desktop computer such as mine, a factor of 25x is probably the best you can hope for (and it is indeed a very impressive speedup for any kind of code).

It is also important to remember that GPU’s are not magical devices. They perform operations with a theoretical peak performance typically 5-15 times larger than a CPU. So whenever you see these 1000x speed up claims (e.g. some of the CUDA showcases), it is probably just an indication of a poor CPU implementation.

But even though the performance of GPU’s may be somewhat exaggerated you can still get a tremendous speedup. And GPU interfaces such as GLSL shaders are really simple to use: you do not need to deal explicitly with threads, you have built-in vectors and matrices, and you can compile GLSL code dynamically, during run-time. All features which makes GPU programming nearly ideal for exploring pixel graphic systems.

Fragmentarium – an IDE for exploring fractals and generative systems on the GPU.

As I mentioned in my previous post, I started experimenting with GLSL raytracing a couple of months ago, and I’m now ready to release the first version of Fragmentarium, an open source, cross-platform IDE for exploring pixel based graphics on the GPU.

It was mainly created for exploring Distance Estimated systems, such as Mandelbulbs or Kaleidoscopic IFS, but it can also be used for 2D systems.

Fragmentarium is inspired by Adobe’s Pixel Bender, but uses pure GLSL, and is specifically created with fractals and generative systems in mind. Besides Pixel Bender, there are also other, more specialized, GPU fractal applications out there, such as Boxplorer and Tacitus, but I wanted something code-centric, where I quickly can modify code and use code in a more modular manner.

Features:

  • Multi-tabbed IDE, with GLSL syntax highlighting

  • Modular GLSL programming – include other fragments
  • User widgets to manipulate parameter settings.
  • Different ‘mouse to GLSL’ mapping schemes (2D and 3D)
  • Includes raytracer for distance estimated systems
  • Many examples including Mandelbulb, Mandelbox, Kaleidoscopic IFS, and Julia Quaternion

Fragmentarium can be downloaded from:
http://syntopia.github.com/Fragmentarium/

There are binaries for Windows, but for now you’ll have to build it yourself for Mac and Linux. You will need a graphics card capable of running GLSL (any reasonably moderne discrete card will do).

Here is a screenshot:

Fragmentarium Screenshot

Fragmentarium screenshot (click to enlarge).

There’s also a gallery at Flickr: Fragmentarium Group

Fragmentarium is not a mature application yet. Especially the camera handling needs some work in the next versions – camera settings are not saved as part of the parameters, no field-of-view control and you often have to compensate for clipping. For future versions I also plan arbitrary resolution renders (tile based rendering) and animations.

There are probably also many small quirks and bugs – I’ve had several problems with ATI drivers, which seems to be much more strict than Nvidias.

Creating a Raytracer for Structure Synth (Part III)

This post discusses some of the things I realised while coding the raytracer in Structure Synth and some thoughts about GPU raytracing.

Luxion

Back in October, I saw a presentation by Henrik Wann Jensen (known for his work on photon mapping and subsurface scattering) where he demonstrated the Keyshot renderer made by his company Luxion. It turned out to be a very impressive demonstration: the system was able to render very complex scenes (a car with a million polygons), with very complex materials (such as subsurface scattering) and very complex lighting, in real-time on a laptop.

After having downloaded a Keyshot renderer demo, I was able to figure out some of the tricks used to achieve this kind of performance.

  • Using panoramic HDRI maps as backgrounds: the use of a panoramic background image gives the illusion of a very complex scene at virtually no computational cost. The 3D objects in the scene even seem to cast shadows on this background image: something which can be done by adding an invisible ground floor. In most cases this simple hack works out quite nice.

  • Image Based Lighting: by using the panoramic HRDI maps to calculate the lighting as well, you achieve very complex and soft lighting, which of course matches the panoramic background. And you don’t have to setup complicated area lights or stuff like that.
  • Progressive rendering: images in Keyshot may take up to minutes before the rendering has converged, but you do not notice. Because Luxshot renders progressively you can immediately adjust your settings without having to wait for the system. This means the perceived rendering speed is much higher than in classic raytracers (such as POV-Ray).


The above image is a Structure Synth model exported in OBJ format, and rendered in Keyshot.

Some of these ideas should be relatively easy to implement. Especially the idea of using panoramic maps as light sources is clever – setting up individual light sources in Structure Synth programmatically would be tedious, and the complex lighting from a panoramic scene is very convincing (at least in Keyshot).

Henrik Wann Jensen also made some interesting comments about GPU computing: essentially he was not very impressed, and seemed convinced that similar performance could be obtained on CPUs, if utilizing the resources properly.

Of course Wann Jensen may be somewhat biased here: after all Luxions biggest competitor (and former partner), Bunkspeed, use a hybrid GPU/CPU approach in their Bunkspeed Shot raytracer. The raytracing technology used here is a licensed version of Mental Images iray engine.

Having tried both raytracers, I must say Luxion’s pure CPU technique was by far the most impressive, but as I will discuss later, GPU raytracing may have other advantages. (And it is really difficult to understand why it should not be possible to benefit from using both CPU and GPU together).

If you haven’t tried Keyshot, it is definitely worth it: I never really liked integrated raytracing environments (such as Blender), because it almost always is so difficult to setup stuff like light sources and materials, but Keyshot is certainly a step forward in usability.

Physically Based Rendering

As I described in my previous post, I made many mistakes during the development of the Raytracer in Structure Synth. And all these mistakes could have been avoided, if I had started out by reading ‘Physically Based Rendering‘ by Matt Pharr and Greg Humphreys. But yet again – part of the fun is trying to figure out new ways to do things, instead of just copying other people.

I have only read fragments of it so far, but still I can wholeheartedly recommend it: even though I’m a physicist myself, I was surprised to see just how far you can get using physical principles to build a complete raytracing framework. It is a very impressive and consistent approach to introducing raytracing, and the book covers nearly all aspects of raytracing I could think of – except one minor thing: I would have loved to see a chapter on GPU raytracing, especially since both authors seem to have extensive experience with this topic.


Microcity.pbrt – modeled by Mark Schafer

But I was flattered to see an image created with Structure Synth in the book – and their pbrt raytracer even has integration to Structure Synth!

GPU Based Rendering

Now, for the past months I’ve made a few experiments with GPU based rendering and generative systems using GLSL. And even though Henrik Wann Jensen may be right, and the performance gain of GPU’s may be somewhat exaggerated, they do have some advantages which are often ignored:

  • GPU programming is easy. If you want to achieve high performance on a CPU, you need to write complex multi-threaded code and do tricky low-level optimizations, such as inlined SSE instructions and custom memory management. On the other hand GLSL is very easy to write and understand and even comes with built-in support for matrices and vectors. Well, actually GPU programming can also be terrible hard: if you need to share memory between threads, or move data between the GPU and CPU, things can get very ugly (and you will need a more powerful API than GLSL). But for ‘embarrassingly parallel’ problems – for instance if each pixel can be calculated independently – with simply, unshared memory access – GPU programming can be just as easy as for instance Processing.

  • Code can be compiled and uploaded to the GPU dynamically. In C++ or Java making a small change to one part of a program requires a recompilation and restart of the program. But with GLSL (and all other GPU API’s) you get a fast compiler you can use during run-time to create highly efficient multi-threaded code.

The combination of being easy, fast, and dynamic, makes GPU rendering wonderful for exploring many types of generative art systems. In fact, probably the biggest hurdle is to setup the boilerplate code required to start a GLSL program. And even though there are many nice tools out there (e.g. Pixel Bender) none of them are really suited for generative systems.

So, for the past month I’ve been working a bit on an integrated environment for exploring GPU accelerated pixel based systems, called Fragmentarium. It is still quite immature, but I plan to release a version (with binary builds) later this January.


Fragmentarium example

This also means that I will use most of my spare time on this project the coming months, but some of the ideas will probably find use in Structure Synth as well.