Optimizing GLSL Code

By making selected variables constant at compile time, some 3D fractals render more than four times faster. Support for easily locking variables has been added to Fragmentarium.

Some time ago, I became aware that the raytracer in Fragmentarium was somewhat slower than both Fractal Labs and Boxplorer for similar systems – this was somewhat puzzling since the DE raycasting technique is pretty much the same. After a bit of investigation, I realized that my standard raytracer had grown slower and slower, as new features had been added (e.g. reflections, hard shadows, and floor planes) – even if the features were turned off!

One way to speed up GLSL code, is by marking some variables constant at compile-time. This way the compiler may optimize code (e.g. unroll loops) and remove unused code (e.g. if hard shadows are disabled). The drawback is that changing these constant variables requires that the GLSL code is compiled again.

It turned out that this does have a great impact on some systems. For instance for the ‘Dodecahedron.frag’, take a look at the following render times:

No constants: 1.4 fps (1.0x)
Constant rotation matrices : 3.4 fps (2.4x)
Constant rotation matrices + Anti-alias + DetailAO: 5.6 fps (4.0x)
All 38 parameters (except camera): 6.1 fps (4.4x)

The fractal rotation matrices are the matrices used inside the DE-loop. Without the constant declarations, they must be calculated from scratch for each pixel, even though they are identical for all pixels. Doing the calculation at compile-time gives a notable speedup of 2.4x (notice that another approach would be to calculate such frame constants in the vertex shader and pass them to the pixel shader as ‘varying’ variables. But according to this post this is – surprisingly – not very effective).

The next speedup – from the ‘Anti-alias’ and ‘DetailAO’ variables – is more subtle. It is difficult to see from the code why these two variables should have such impact. And in fact, it turns out that combinations of other variables will amount in the same speedup. But these speedups are not additive! Even if you make all variables constants, the framerate only increases slightly above 5.6 fps. It is not clear why this happens, but I have a guess: it seems that when the complexity is lowered between a certain treshold, the shader code execution speed increases sharply. My guess is that for complex code, the shader runs out of free registers and needs to perform calculations using a slower kind of memory storage.

Interestingly, the ‘iterations’ variable offers no speedup – even though the compiler must be able to unroll the principal DE loop, there is no measurable improvement by doing it.

Finally, the compile time is also greatly reduced when making variables constant. For the ‘Dodecahedron.frag’ code, the compile time is ~2000ms with no constants. By making most variables constant, the compile time is lowered to around ~335ms on my system.

Locking in Fragmentarium.

In Fragmentarium variables can be locked (made compile-time constant) by clicking the padlock next to them. Locked variables appear with a yellow padlock next to them. When a variable is locked, any changes to it will first be executed when the system is compiled (by pressing ‘build’). Locked variables, which have been changes, will appear with a yellow background until the system is compiled, and the changes are executed.

Notice, that whole parameter groups may be locked, by using the buttons at the bottom.

Screenshot

The locking interface – click to enlarge.


The ‘AntiAlias’ and ‘DetailAO’ variables are locked. The ‘DetailAO’ has been changed, but the changes are not executed yet (the yellow background). The ‘BoundingSphere’ variable has a grey background, because it has keyboard focus: its value can be finetuned using the arrow keys (up/down controls step size, left/right changes value).

In a fragment, a user variable can be marked as locked by default, by adding a ‘locked’ keyword to it:

uniform float Scale; slider[-5.00,2.0,4.00] Locked

Some variables can not be locked – e.g. the camera settings. It is possible to mark such variables by the ‘NotLockable’ keyword:

uniform vec3 Eye; slider[(-50,-50,-50),(0,0,-10),(50,50,50)] NotLockable

The same goes for presets. Here the locking mode can be stated, if it is different from the default locking mode:

#preset SomeName
AntiAlias = 1 NotLocked
Detail = -2.81064 Locked
Offset = 1,1,1
...
#endpreset

Locking will be part of Fragmentarium v0.9, which will be released soon.

Syntopia Blog Update

It has not been possible to post comments at my blog for some months. Apparently, my reCAPTCHA plugin was broken (amazingly, spam comments still made their way into the moderation queue).

This should be fixed now.

I’m also on twitter now: @SyntopiaDK, where I’ll post links and news releated to generative systems, 3D fractals, or whatever pops up.

Finally, if you are near Stockholm, some of my images are on display at a small gallery (from July 9th to September 11th): Kungstensgatan 27.

Plotting High-frequency Functions Using a GPU.

A slight digression from the world of fractals and generative art: This post is about drawing high-quality graphs of high-frequency functions.

Yesterday, I needed to draw a few graphs of some simple functions. I started out by using Iñigo Quilez’s nice little GraphToy, but my functions could not be expressed in a single line. So I decided to implement a graph plotter example in Fragmentarium instead.

Plotting a graph using a GLSL shader is not an obvious task – you have to frame the problem in a way, such that each pixel can be processed individually. This is in contrast to the standard way of drawing graphs – where you choose a uniform set of values for the x-axis, and draw the lines connecting the points in the (x,f(x)) set.

So how do you do it for each pixel individually?

The first thing to realize is, that it is easy to determine whether a pixel is above or below the graph – this can be done by checking whether y<f(x) or y>f(x). The tricky part is, that we only want to draw the boundary – the curve that separates the regions above and below the graph.

So how do we determine the boundary? After having tried a few different approaches, I came up with the following simple edge detection procedure: for each pixel, choose a number of samples, in a region around the pixel center. Then count how many samples are above, and how many samples are below the curve.

If all samples are above, or all samples are below, the pixel is not on the boundary. However, if there are samples both above and below, the boundary must be passing through the pixel.

The whole idea can be expressed in a few lines of code:

for (float i = 0.0; i < samples; i++) {
	for (float  j = 0.0;j < samples; j++) {
		float f = function(pos.x+ i*step.x)-(pos.y+ j*step.y);
		count += (f>0.) ? 1 : -1;
	}
}
// base color on abs(count)/(samples*samples)	

It should be noted, that the sampling can be improved by adding a small amount of jittering (random offsets) to the positions – this reduces the aliasing at the cost of adding a small amount of noise.

Highfrequency functions and aliasing

So why it this better than the common ‘connecting line’ approach?

Because this approach deals with the high-frequency information much better.

Consider the function f(x)=sin(x*x*x)*sin(x).
Here is a plot from GraphToy:

Notice how the graph near the red arrows seem to be slowly varying. This is not the true behavior of function, but an artifact of the way we sample our function. Our limited resolution cannot capture the high frequency components, which results in aliasing.

Whenever you do anything media-related on a computer, you will at some point run into problems with aliasing: whether you are doing sound synthesis, image manipulation, 3D rendering or even drawing a straight line.

However, using the pixel shader approach, aliasing is much easier to avoid. Here is a Fragmentarium plot of the same function:

Even though it may seem backwards to evaluate the function for all pixels on the screen, it makes it possible to tame the aliasing, and even on a modest GPU, the procedure is fast enough for realtime interactions.

The example is included in GitHub under Examples/2D Systems/GraphPlotter.frag.

Distance Estimated 3D Fractals (Part I)

During the last two years, the 3D fractal field has undergone a small revolution: the Mandelbulb (2009), the Mandelbox (2010), The Kaleidoscopic IFS’s (2010), and a myriad of equally or even more interesting hybrid systems, such as Spudsville (2010) or the Kleinian systems (2011).

All of these systems were made possible using a technique known as Distance Estimation and they all originate from the Fractal Forums community.

Overview of the posts

Part I briefly introduces the history of distance estimated fractals, and discuss how a distance estimator can be used for ray marching.

Part II discuss how to find surface normals, and how to light and color fractals.

Part III discuss how to actually create a distance estimator, starting with distance fields for simple geometric objects, and talking about instancing, combining fields (union, intersections, and differences), and finally talks about folding and conformal transformation, ending up with a simple fractal distance estimator.

Part IV discuss the holy grail: the search for generalization of the 2D (complex) Mandelbrot set, including Quaternions and other hypercomplex numbers. A running derivative for quadratic systems is introduced.

Part V continues the discussion about the Mandelbulb. Different approaches for constructing a running derivative is discussed: scalar derivatives, Jacobian derivatives, analytical solutions, and the use of different potentials to estimate the distance.

Part VI is about the Mandelbox fractal. A more detailed discussion about conformal transformations, and how a scalar running derivative is sufficient, when working with these kind of systems.

Part VII discuss how dual numbers and automatic differentation may used to construct a distance estimator.

Part VIII is about hybrid fractals, geometric orbit traps, various other systems, and links to relevant software and resources.

The background

The first paper to introduce Distance Estimated 3D fractals was written by Hart and others in 1989:
Ray tracing deterministic 3-D fractals

In this paper Hart describe how Distance Estimation may be used to render a Quaternion Julia 3D fractal. The paper is very well written and definitely worth spending some hours on (be sure to take a look at John Hart’s other papers as well). Given the age of Hart’s paper, it is striking that is not until the last couple of years that the field of distance estimated 3D fractals has exploded. There has been some important milestones, such as Keenan Crane’s GPU implementation (2004), and Iñigo Quilez 4K demoscene implementation (2007), but I’m not aware of other fractal systems being explored using Distance Estimation, before the advent of the Mandelbulb.

Raymarching

Classic raytracing shoots one (or more) rays per pixel and calculate where the rays intersect the geometry in the scene. Normally the geometry is described by a set of primitives, like triangles or spheres, and some kind of spatial acceleration structure is used to quickly identify which primitives intersect the rays.

Distance Estimation, on the other hand, is a ray marching technique.

Instead of calculating the exact intersection between the camera ray and the geometry, you proceed in small steps along the ray and check how close you are to the object you are rendering. When you are closer than a certain threshold, you stop. In order to do this, you must have a function that tells you how close you are to the object: a Distance Estimator. The value of the distance estimator tells you how large a step you are allowed to march along the ray, since you are guaranteed not to hit anything within this radius.

Schematics of ray marching using a distance estimator.

The code below shows how to raymarch a system with a distance estimator:

float trace(vec3 from, vec3 direction) {
	float totalDistance = 0.0;
	int steps;
	for (steps=0; steps < MaximumRaySteps; steps++) {
		vec3 p = from + totalDistance * direction;
		float distance = DistanceEstimator(p);
		totalDistance += distance;
		if (distance < MinimumDistance) break;
	}
	return 1.0-float(steps)/float(MaxRaySteps);
}

Here we simply march the ray according to the distance estimator and return a greyscale value based on the number of steps before hitting something. This will produce images like this one (where I used a distance estimator for a Mandelbulb):

It is interesting that even though we have not specified any coloring or lighting models, coloring by the number of steps emphasizes the detail of the 3D structure - in fact, this is an simple and very cheap form of the Ambient Occlusion soft lighting often used in 3D renders.

Parallelization

Another interesting observation is that these raymarchers are trivial to parallelise, since each pixel can be calculated independently and there is no need to access complex shared memory structures like the acceleration structure used in classic raytracing. This means that these kinds of systems are ideal candidates for implementing on a GPU. In fact the only issue is that most GPU's still only supports single precision floating points numbers, which leads to numerical inaccuracies faster than for the CPU implementations. However, the newest generation of GPU's support double precision, and some API's (such as OpenCL and Pixel Bender) are heterogeneous, meaning the same code can be executed on both CPU and GPU - making it possible to create interactive previews on the GPU and render final images in double precision on the CPU.

Estimating the distance

Now, I still haven't talked about how we obtain these Distance Estimators, and it is by no means obvious that such functions should exist at all. But it is possible to intuitively understand them, by noting that systems such as the Mandelbulb and Mandelbox are escape-time fractals: we iterate a function for each point in space, and follow the orbit to see whether the sequence of points diverge for a maximum number of iterations, or whether the sequence stays inside a fixed escape radius. Now, by comparing the escape-time length (r), to its spatial derivative (dr), we might get an estimate of how far we can move along the ray before the escape-time length is below the escape radius, that is:

\(DE = \frac{r-EscapeRadius }{dr}\)
 
This is a hand-waving estimate - the derivative might fluctuate wildly and get larger than our initial value, so a more rigid approach is needed to find a proper distance estimator. I'll a lot more to say about distance estimators inthe later posts, so for now we will just accept that these function exists and can be obtained for quite a diverse class of systems, and that they are often constructed by comparing the escape-time length with some approximation of its derivative.

It should also be noticed that this ray marching approach can be used for any kinds of systems, where you can find a lower bound for the closest geometry for all points in space. Iñigo Quilez has used this in his impressive procedural SliseSix demo, and has written an excellent introduction, which covers many topics also relevant for Distance Estimation of 3D fractals.

This concludes the first part of this series of blog entries. Part II discusses lighting and coloring of fractals.

Hybrid 3D Fractals

A lot of great images have been made of the Mandelbulb, the Mandelbox, and the various kaleidoscopic IFS’s (the non-platonic non-solids). And it turns out that by combining these formulas (and stirring a few assorted functions into the mix), a variety of new, amazing, and surprising forms emerge.

I’m currently working on making it easier to combine different formulas in Fragmentarium – but until I get something released, here is a collection of images and movies created by Mandelbulb 3D (Windows, free) and Mandelbulber (Windows, free, open-source), that illustrates the beauty and diversity of these hybrid systems. Be sure to view the large versions by following the links. The images were all found at Fractal Forums.

Videos


Buddhi – Mandelbox and Flying Lights


Jérémie Brunet (Bib) – Weird Planet II


Jérémie Brunet (Bib) – Like in a dream II

Images


Lenord – Pray your Gods


Tomot – It’s a jungle out there


Lenord – J.A.R.


MarkJayBee – Security Mechanisms


Fractal00 – Alien Stones


Kr0mat1k – Restructuration


BrutalToad – Jülchen

Fragmentarium v0.8

I’ve released a new build of Fragmentarium with some much needed updates, including better camera control, high resolution renders, and animation.

New features in version 0.8:

  • The 3D camera has been rewritten: it is now a “first-person”, pinhole camera (like Boxplorer and Fractal Lab), and is controllable using mouse and keyboard. Camera view can now be saved together with other settings.
  • Arbitrary resolution renderings (using tile based rendering – the GPU won’t time out).
  • Preview modes (renders to FBO with lower resolution and rescales).
  • ‘Tile preview’ for previewing part of high-resolution renders.
  • Animation controller (experimental: no keyframes yet, you must animate using the system supplied ‘time’ variable. Animation is output as a sequence of still images).
  • Presets (group parameters settings and load them into a dropbox)
  • New fractals: QuaternionMandelbrot4D, Ducks, NewMenger.
  • Improved raytracer: dithering, fog, new coloring schemes.

Download it here: http://syntopia.github.com/Fragmentarium/get.html


High-resolution render of a 4D Quaternion Mandelbrot (click for large)


Samuel Monnier’s ‘Ducks’ Fractal has been added.


Mandelbrot/Julia type system now with embedded Mandelbrot map.

Fragmentarium test animation – click here for higher resolution.

GPU versus CPU for pixel graphics

After having gained a bit of experience with GPU shader programming during my Fragmentarium development, a natural question to ask is: how fast are these GPU’s?

This is not an easy question to answer, and it depends on the specific application. But I will try to give an answer for the kind of systems that I’m interested in: pixel graphics systems, where each pixel can be calculated independently of the others, such as raytraced 3D fractals.

Lets take my desktop computer, a fairly standard desktop machine, as an example. It is equipped with Nvidia Geforce 9800GT GPU @ 1.5 GHz, and a Intel Core 2 Quad Q8200 @ 2.33GHz.

How many processing unit are there?

Number of processing units (CPU): 4 CPU cores
Number of processing units (GPU): 112 Shader units

Based on these numbers, we might expect the GPU to be a factor of 28x times faster than the CPU. Of course, this totally ignores the efficiency and operating speed of the processing units. Lets try looking at the processing power in terms of maximum number of floating-point operations per second instead:

Theoretical Peak Performance

Both Intel and Nvidia list the GFLOPS (billion floating point operations per second) rating for their products. Intel’s list can be found here, and Nvidia’s here. For my system, I found the following numbers:

Performance (CPU): 37.3 GFLOPS
Performance (GPU): 504 GFLOPS

Based on these numbers, we might expect the GPU to be a factor of 14x times faster than the CPU. But what do these numbers really mean, and can they be compared? It turns out that these number are obtained by multiplying the processor frequency by the maximum number of instructions per clock cycle.

For the CPU, we have four cores. Now, when Intel calculate their numbers, they do it based on the special 128-bit SSE registers on every modern Pentium derived CPU. These extensions make it possible to handle two double precision floating point, or four single precision floating point numbers per clock cycle. And in fact there exists a special instruction – the MAD, or Multiply-Add, instruction – which allows for two arithmetic operations per clock cycle on each element in the SSE registers. These means Intel assume 4 (cores) x 2 (double precision floats) x 2 (MAD instructions) = 16 instructions per clock cycle. This gives the theoretical peak performance stated above:

Performance (CPU): 2.33 GHz * 4 * 2 * 2 = 37.3 GFLOPS (double precision floats)

What about the GPU? Here we have 112 independent processing units. On the GPU architecture an even more benchmarking-friendly instruction exists: the MAD+MUL which combines two multiplies and one addition in a single clock cycle. This means Nvidia assumes 112 (cores) * 3 (MAD+MUL instructions) = 336 instructions per clock cycle. Combining this with a stated processing frequency of 1.5 GHz, we arrive at the number stated earlier:

Performance (GPU): 1.5 GHz * 112 * 3 = 504 GFLOPS (single precision floats)

But wait… Nvidia’s number are for single precision floats – the Geforce 8800GT does not even support double precision floats. So for a fair comparison we should double Intel’s number, since the SSE extensions allows four simultaneous single precision numbers to be processed instead of two double precision floats. This way we get:

Performance (CPU): 2.33 GHz * 4 * 4 * 2 = 74.6 GFLOPS (single precision floats)

Now, using this as a guideline, we would expect my GPU to be a factor of 6.8x faster than my CPU. But we have some pretty big assumptions here: for instance, not many CPU programmers would write SSE-optimized code – and is a modern C++ compiler powerful enough to automatically take advantage of them anyway? And how often is the GPU able to use the three operation MUL+MAD instruction?

A real-world experiment

To find out I wrote a simple 2D Mandelbrot system and benchmarked it on the CPU and GPU. This is really the kind of computational tasks that I’m interested in: it is trivial to parallelize and is not memory-intensive, and the majority of executed code will be floating point arithmetics. I did not try to optimize the C++ code, because I wanted to see if the compiler was able to perform some SSE optimization for me. Here are the execution times:

13941 ms – CPU single precision (x87)
13941 ms – CPU double precision (x87)
10535 ms – CPU single precision (SSE)
11367 ms – CPU double precision (SSE)
424 ms – GPU single precision

(These numbers have some caveats – I did perform the tests multiple times and discarded the first few runs, but the CPU code was only single-threaded – so I assumed the numbers would scale perfectly and divided the execution times by four. Also, I verified by checking the generated assembly code, that SSE instructions indeed were used for the core Mandelbrot loop, when they were enabled.).

There are a couple of things to notice here: first, there is no difference between single and double precision on the CPU. This is as could be expected for the x87 compiled code (since the x87 defaults to 80-bit precision anyway), but for the SSE version, we would expect a double up in speed. As can be seen, the SSE code is really not very much more efficient the the x87 code – which strongly suggests that the compiler (here Visual Studio C++ 2008) is not very good at optimizing for SSE.

So for this example we got a factor of 25x speedup by using the GPU instead of the CPU.

“Measured” GFLOPS

Another questions is how this example compares to the theoretical peak performance. By using Nvidia’s Cg SDK I was able to get the GPU assembly code. Since I now could count the number of instruction in the main loop, and I knew how many iterations were performed, I was able to calculate the actual number of floating point operations per second:

GPU: 211 (Mandel)GFLOPS
CPU: 8.4 (Mandel)GFLOPS*

(*The CPU number was obtained by assuming the number of instructions in the core loop was the same as for the GPU: in reality, the CPU disassembly showed that the simple loop was partially unrolled to more than 200 lines of very complex assembly code.)

Compared to the theoretical maximum numbers of 504 GFLOPS and 74.6 GFLOPS respectively, this shows the GPU is much closer to its theoretical limit than the CPU.

GPU Caps Viewer – OpenCL

A second test was performed using the GPU Caps Viewer. This particular application includes a 4D Quaternion Julia fractal demo in OpenCL. This is interesting since OpenCL is a heterogeneous platform – it can be compiled to both CPU and GPU. And since Intel just released an alpha version of their OpenCL SDK, I could compare it to Nvidia’s SDK.

The results were interesting:

Intel OpenCL: ~5 fps
Nvidia OpenCL: ~35 fps

(The FPS did vary through the animation, so these numbers are not very accurate. There were no dedicated benchmark mode.)

This suggest that Intel’s OpenCL compiler is actually able to take advantage of the SSE instructions and provides a highly optimized output. Either that, or Nvidia’s OpenCL implementation is not very efficient (which is not likely).

The OpenCL based benchmark showed my GPU to be approximately 7x times faster than my CPU. Which is exactly the same as predicted by comparing the theoretical GFLOPS values (for single precision).

Conclusion

For normal code written in a high-level language like C or GLSL (multithreaded, single precision, and without explicit SSE instructions) the computational power is roughly equivalent to the number of cores or shader units. For my system this makes the GPU a factor of 25x faster.

Even though the CPU cores have higher operating frequency and in principle could execute more instructions via their SSE registers, this does not seem be fully utilized (and in fact, compiling with and without SSE optimization did not make a significant difference, even for this very simple example).

The OpenCL example tells another story: here the measured performance was proportional to the theoretical GFLOPS ratings. This is interesting since this indicate, that OpenCL could also be interesting for CPU-applications.

One thing to bear in mind is, that the examples tested here (the Mandelbrot and 4D Quaternion Julia) are very well-suited for GPU execution. For more complex code, with conditional branching, double precision floating point operations, and non-coalesced memory access, the CPU is much more efficient than the GPU. So for a desktop computer such as mine, a factor of 25x is probably the best you can hope for (and it is indeed a very impressive speedup for any kind of code).

It is also important to remember that GPU’s are not magical devices. They perform operations with a theoretical peak performance typically 5-15 times larger than a CPU. So whenever you see these 1000x speed up claims (e.g. some of the CUDA showcases), it is probably just an indication of a poor CPU implementation.

But even though the performance of GPU’s may be somewhat exaggerated you can still get a tremendous speedup. And GPU interfaces such as GLSL shaders are really simple to use: you do not need to deal explicitly with threads, you have built-in vectors and matrices, and you can compile GLSL code dynamically, during run-time. All features which makes GPU programming nearly ideal for exploring pixel graphic systems.

Fragmentarium – an IDE for exploring fractals and generative systems on the GPU.

As I mentioned in my previous post, I started experimenting with GLSL raytracing a couple of months ago, and I’m now ready to release the first version of Fragmentarium, an open source, cross-platform IDE for exploring pixel based graphics on the GPU.

It was mainly created for exploring Distance Estimated systems, such as Mandelbulbs or Kaleidoscopic IFS, but it can also be used for 2D systems.

Fragmentarium is inspired by Adobe’s Pixel Bender, but uses pure GLSL, and is specifically created with fractals and generative systems in mind. Besides Pixel Bender, there are also other, more specialized, GPU fractal applications out there, such as Boxplorer and Tacitus, but I wanted something code-centric, where I quickly can modify code and use code in a more modular manner.

Features:

  • Multi-tabbed IDE, with GLSL syntax highlighting
  • Modular GLSL programming – include other fragments
  • User widgets to manipulate parameter settings.
  • Different ‘mouse to GLSL’ mapping schemes (2D and 3D)
  • Includes raytracer for distance estimated systems
  • Many examples including Mandelbulb, Mandelbox, Kaleidoscopic IFS, and Julia Quaternion

Fragmentarium can be downloaded from:
http://syntopia.github.com/Fragmentarium/

There are binaries for Windows, but for now you’ll have to build it yourself for Mac and Linux. You will need a graphics card capable of running GLSL (any reasonably moderne discrete card will do).

Here is a screenshot:

Fragmentarium Screenshot

Fragmentarium screenshot (click to enlarge).

There’s also a gallery at Flickr: Fragmentarium Group

Fragmentarium is not a mature application yet. Especially the camera handling needs some work in the next versions – camera settings are not saved as part of the parameters, no field-of-view control and you often have to compensate for clipping. For future versions I also plan arbitrary resolution renders (tile based rendering) and animations.

There are probably also many small quirks and bugs – I’ve had several problems with ATI drivers, which seems to be much more strict than Nvidias.

Creating a Raytracer for Structure Synth (Part III)

This post discusses some of the things I realised while coding the raytracer in Structure Synth and some thoughts about GPU raytracing.

Luxion

Back in October, I saw a presentation by Henrik Wann Jensen (known for his work on photon mapping and subsurface scattering) where he demonstrated the Keyshot renderer made by his company Luxion. It turned out to be a very impressive demonstration: the system was able to render very complex scenes (a car with a million polygons), with very complex materials (such as subsurface scattering) and very complex lighting, in real-time on a laptop.

After having downloaded a Keyshot renderer demo, I was able to figure out some of the tricks used to achieve this kind of performance.

  • Using panoramic HDRI maps as backgrounds: the use of a panoramic background image gives the illusion of a very complex scene at virtually no computational cost. The 3D objects in the scene even seem to cast shadows on this background image: something which can be done by adding an invisible ground floor. In most cases this simple hack works out quite nice.
  • Image Based Lighting: by using the panoramic HRDI maps to calculate the lighting as well, you achieve very complex and soft lighting, which of course matches the panoramic background. And you don’t have to setup complicated area lights or stuff like that.
  • Progressive rendering: images in Keyshot may take up to minutes before the rendering has converged, but you do not notice. Because Luxshot renders progressively you can immediately adjust your settings without having to wait for the system. This means the perceived rendering speed is much higher than in classic raytracers (such as POV-Ray).


The above image is a Structure Synth model exported in OBJ format, and rendered in Keyshot.

Some of these ideas should be relatively easy to implement. Especially the idea of using panoramic maps as light sources is clever – setting up individual light sources in Structure Synth programmatically would be tedious, and the complex lighting from a panoramic scene is very convincing (at least in Keyshot).

Henrik Wann Jensen also made some interesting comments about GPU computing: essentially he was not very impressed, and seemed convinced that similar performance could be obtained on CPUs, if utilizing the resources properly.

Of course Wann Jensen may be somewhat biased here: after all Luxions biggest competitor (and former partner), Bunkspeed, use a hybrid GPU/CPU approach in their Bunkspeed Shot raytracer. The raytracing technology used here is a licensed version of Mental Images iray engine.

Having tried both raytracers, I must say Luxion’s pure CPU technique was by far the most impressive, but as I will discuss later, GPU raytracing may have other advantages. (And it is really difficult to understand why it should not be possible to benefit from using both CPU and GPU together).

If you haven’t tried Keyshot, it is definitely worth it: I never really liked integrated raytracing environments (such as Blender), because it almost always is so difficult to setup stuff like light sources and materials, but Keyshot is certainly a step forward in usability.

Physically Based Rendering

As I described in my previous post, I made many mistakes during the development of the Raytracer in Structure Synth. And all these mistakes could have been avoided, if I had started out by reading ‘Physically Based Rendering‘ by Matt Pharr and Greg Humphreys. But yet again – part of the fun is trying to figure out new ways to do things, instead of just copying other people.

I have only read fragments of it so far, but still I can wholeheartedly recommend it: even though I’m a physicist myself, I was surprised to see just how far you can get using physical principles to build a complete raytracing framework. It is a very impressive and consistent approach to introducing raytracing, and the book covers nearly all aspects of raytracing I could think of – except one minor thing: I would have loved to see a chapter on GPU raytracing, especially since both authors seem to have extensive experience with this topic.


Microcity.pbrt – modeled by Mark Schafer

But I was flattered to see an image created with Structure Synth in the book – and their pbrt raytracer even has integration to Structure Synth!

GPU Based Rendering

Now, for the past months I’ve made a few experiments with GPU based rendering and generative systems using GLSL. And even though Henrik Wann Jensen may be right, and the performance gain of GPU’s may be somewhat exaggerated, they do have some advantages which are often ignored:

  • GPU programming is easy. If you want to achieve high performance on a CPU, you need to write complex multi-threaded code and do tricky low-level optimizations, such as inlined SSE instructions and custom memory management. On the other hand GLSL is very easy to write and understand and even comes with built-in support for matrices and vectors. Well, actually GPU programming can also be terrible hard: if you need to share memory between threads, or move data between the GPU and CPU, things can get very ugly (and you will need a more powerful API than GLSL). But for ’embarrassingly parallel’ problems – for instance if each pixel can be calculated independently – with simply, unshared memory access – GPU programming can be just as easy as for instance Processing.
  • Code can be compiled and uploaded to the GPU dynamically. In C++ or Java making a small change to one part of a program requires a recompilation and restart of the program. But with GLSL (and all other GPU API’s) you get a fast compiler you can use during run-time to create highly efficient multi-threaded code.

The combination of being easy, fast, and dynamic, makes GPU rendering wonderful for exploring many types of generative art systems. In fact, probably the biggest hurdle is to setup the boilerplate code required to start a GLSL program. And even though there are many nice tools out there (e.g. Pixel Bender) none of them are really suited for generative systems.

So, for the past month I’ve been working a bit on an integrated environment for exploring GPU accelerated pixel based systems, called Fragmentarium. It is still quite immature, but I plan to release a version (with binary builds) later this January.


Fragmentarium example

This also means that I will use most of my spare time on this project the coming months, but some of the ideas will probably find use in Structure Synth as well.

Creating a Raytracer for Structure Synth (Part II)

When I decided to implement the raytracer in Structure Synth, I figured it would be an easy task – after all, it should be quite simple to trace rays from a camera and check if they intersect the geometry in the scene.

And it turned out, that it actually is quite simple – but it did not produce very convincing pictures. The Phong-based lighting and hard shadows are really not much better than what you can achieve in OpenGL (although the spheres are rounder). So I figured out that what I wanted was some softer qualities to the images. In particular, I have always liked the Ambient Occlusion and Depth-of-field in Sunflow. One way to achieve this is by shooting a lot of rays for each pixel (so-called distributed raytracing). But this is obviously slow.

So I decided to try to choose a smaller subset of samples for estimating the ambient occlusion, and then do some intelligent interpolation between these points in screen space. The way I did this was to create several screen buffers (depth, object hit, normal) and then sample at regions with high variations in these buffers (for instance at every object boundary). Then followed the non-trivial task of interpolating between the sampled pixels (which were not uniformly distributed). I had an idea that I could solve this by relaxation (essentially iterative smoothing of the AO screen buffer, while keeping the chosen samples fixed) – the same way the Laplace equation can be numerically solved.

While this worked, it had a number of drawbacks: choosing the condition for where to sample was tricky, the smoothing required many steps to converge, and the approach could not be easily multi-threaded. But the worst problem was that it was difficult to combine with other stuff, such as anti-alias and depth-of-field calculations, so artifacts would show up in the final image.

I also played around with screen based depth-of-field. Again I thought it would be easy to apply a Gaussian blur based on the z-buffer depth (of course you have to prevent background objects from blurring the foreground, which complicates things a bit). But once again, it turned out that creating a Gaussian filter for each particular depth actually gets quite slow. Of course you can bin the depths, and reuse the Gaussian filters from a cache, but this approach got complicated, and the images still displayed artifacts. And a screen based method will always have limitations: for instance, the blur from an object hidden behind another object will never be visible, because the object is not part of the screen buffers.

So in the end, I ended up discarding all the hacks, and settled for the much more satisfying solution of simply using a lot of rays for each pixel.

This may sound very slow: after all you need multiple rays for anti-alias, multiple rays for depth-of-field, multiple rays for ambient occlusion, for reflections, and so forth, which means you might end up with a combinatorial explosion of rays per pixel. But in practice there is a nice shortcut: instead of trying all combinations, just choose some random samples from all the possible combinations.

This works remarkably well. You can simulate all these complex phenomena with a reasonably number of rays. And you can use more clever sampling strategies in order to reduce the noise (I use stratified sampling in Structure Synth). The only drawback is, that you need a bit of book-keeping to prepare your stratified samples (between threads) and ensure you don’t get coherence between the different dimensions you sample.

Another issue was how to accelerate the ray-object intersections. This is a crucial part of all raytracers: if you need to check your rays against every single object in the scene, the renders will be extremely slow – the rendering time will be proportional to the number of objects. On the other hand spatial acceleration structures are often able to render a scene in a time proportional to the logarithm of the number of objects.

For the raytracer in Structure Synth I chose to use a uniform grid (aka voxel stepping). This turned out to be a very bad choice. The uniform grid works very well, when the geometry is evenly distributed in the scene. But for recursive systems, objects in a scene often appear at very different scales, making the cells in the grid very unevenly populated.

Another example of this is, that I often include a ground plane in my Structure Synth scenes (by using a flat box, such as “{ s 1000 1000 0.1 } box”). But this will completely kill the performance of the uniform grid – most objects will end up in the same cell in the grid, and the acceleration structure gets useless. So in general, for generative systems with different scales, the use of a uniform grid is a bad choice.

Not that is a lot of stuff, that didn’t work out well. So what is working?

As of now the raytracer in Structure Synth provides a nice foundation for things to come. I’ve gotten the multi-threaded part set correctly up, which includes a system for coordinating stratified samples. Each thread have its own (partial) screen space buffer, which means I can do progressive rendering. This also makes it possible to implement more complex filtering (where the filtered samples may contribute to more than one pixel – in which case the raytracer is not embarrassingly parallel anymore).

What is missing?

Materials. As of now there is only very limited control of materials. And things like transparency doesn’t work very well.

Filtering. As I mentioned above, the multi-threaded renderer supports working with filters, but I haven’t included any filters in the latest release. My first experiments (with a Gaussian filter) were not particularly successful.

Lighting. As of now the only option is a single, white, point-like light source casting hard shadows. This rarely produce nice pictures.

In the next post I’ll talk a bit about what I’m planning for future versions of the raytracers.