Icon Earthbound

 

Icon Turing Drawings

 

Icon Lost Child Announcement

 

Icon Shelter

 

Icon Data Science, how hard can it be?

 

Icon Denki Furo

 

Icon In Defence of the Unitype

 

Icon Maya Velocity Node

 

Icon Sandy Denny

 

Icon What type of Machine is the C Preprocessor?

 

Icon Which AI is more human?

 

Icon Gone Home

 

Icon Thoughts on Japan

 

Icon Can Computers Think?

 

Icon Counting Sheep & Infinity

 

Icon How Nature Builds Computers

 

Icon Painkillers

 

Icon Correct Box Sphere Intersection

 

Icon Avoiding Shader Conditionals

 

Icon Writing Portable OpenGL

 

Icon The Only Cable Car in Ireland

 

Icon Is the C Preprocessor Turing Complete?

 

Icon The aesthetics of code

 

Icon Issues with SDL on iOS and Android

 

Icon How I learned to stop worrying and love statistics

 

Icon PyMark

 

Icon AutoC Tools

 

Icon Scripting xNormal with Python

 

Icon Six Myths About Ray Tracing

 

Icon The Web Giants Will Fall

 

Icon PyAutoC

 

Icon The Pirate Song

 

Icon Dear Esther

 

Icon Unsharp Anti Aliasing

 

Icon The First Boy

 

Icon Parallel programming isn't hard, optimisation is.

 

Icon Skyrim

 

Icon Recognizing a language is solving a problem

 

Icon Could an animal learn to program?

 

Icon RAGE

 

Icon Pure Depth SSAO

 

Icon Synchronized in Python

 

Icon 3d Printing

 

Icon Real Time Graphics is Virtual Reality

 

Icon Painting Style Renderer

 

Icon A very hard problem

 

Icon Indie Development vs Modding

 

Icon Corange

 

Icon 3ds Max PLY Exporter

 

Icon A Case for the Technical Artist

 

Icon Enums

 

Icon Scorpions have won evolution

 

Icon Dirt and Ashes

 

Icon Lazy Python

 

Icon Subdivision Modelling

 

Icon The Owl

 

Icon Mouse Traps

 

Icon Updated Art Reel

 

Icon Tech Reel

 

Icon Graphics Aren't the Enemy

 

Icon On Being A Games Artist

 

Icon The Bluebird

 

Icon Everything2

 

Icon Duck Engine

 

Icon Boarding Preview

 

Icon Sailing Preview

 

Icon Exodus Village Flyover

 

Icon Art Reel

 

Icon LOL I DREW THIS DRAGON

 

Icon One Cat Just Leads To Another

Parallel programming isn't hard, optimisation is.

Created on Jan. 17, 2012, 2:31 p.m.


I'm standing in the living room and I ask my flatmates if any of them want to go to the pub. I then count the positive replies. I've just computed something in parallel. An exact instance of map-reduce. And it comes as naturally (if not more naturally) to me as the sequential approach.

We have the human intuition and analogy. Almost all parallel problems can be expressed as "Don't touch that while I'm doing something with it", or "Everyone do something and then let me gather the results". The only remaining hard part is the locality information. For example a particle system where a single particle is only affected by the nearest ten other particles. And then, most locality instances can be expressed simply by how the data is partitioned in map reduce.

Parallel concepts don't need to be hard, but the great contradiction is that whenever we are concerned with making processing parallel, we are also concerned about performance - and reasoning about the performance of parallel computation is hard. Any kind of sequential assertions you make can be thrown out the window. Suddenly the hardware and cache become a major issue, and what was before a relatively small search space, comprising of special data structures, C tricks and inline assembly, has become this huge hulking monster of possibilities.

I've been working on using OpenCL to do non-conventional rendering techniques and my exploration has lead me toward needing to implement an instance of the Marching Cubes algorithm. Something I've done before. What has me tied up this time is the vast number of possible parallel approaches at arms reach. I can at least vaguely reason about how difficult the various approaches are to implement, but when it comes to predicting the expense of them I'm somewhat in the dark. Big O notation can't help me here.

To give you an idea, one approach is a to fit marching cubes into a map reduce pattern - which is temping for it's conceptual simplicity but unfortunately reduction can't be completely automated in OpenCL and would require a variable number of reduction passes depending on the data size. It ends up more complicated than it seems. Another option is to essentially put a lock around two buffers and fill them up as workers finish, but I have no idea how much congestion this would cause. I could even do most of the computation in a very fast, single dispatch, but then I would need a final clean-up pass which would have to run sequentially and may end up very expensive. None of the approaches are really trivial enough to just mock up and test out, and I can't find a whole bunch of testimonies either.

Until programmers can begin to reason about the expense of parallel programming without resorting to benchmarks, it is always going to be somewhat of a struggle for them to adopt it. This is something that hardware developers, language developers and academia all need to work on together.

For a great lecture on the subject and fascinating delve into the hardware take a look at this by Bill Dally.

github twitter rss