Parallel programming isn't hard, optimisation is.
17/01/2012
I'm standing in the living room and I ask my flatmates if any of them want to go to the pub. I then count the positive replies. I've just computed something in parallel. An exact instance of map-reduce. And it comes as naturally (if not more naturally) to me as the sequential approach.
We have the human intuition and analogy. Almost all parallel problems can be expressed as "Don't touch that while I'm doing something with it", or "Everyone do something and then let me gather the results". The only remaining hard part is the locality information. For example a particle system where a single particle is only affected by the nearest ten other particles. And then, most locality instances can be expressed simply by how the data is partitioned in map reduce.
Parallel concepts don't need to be hard, but the great contradiction is that whenever we are concerned with making processing parallel, we are also concerned about performance - and reasoning about the performance of parallel computation is hard. Any kind of sequential assertions you make can be thrown out the window. Suddenly the hardware and cache become a major issue, and what was before a relatively small search space, comprising of special data structures, C tricks and inline assembly, has become this huge hulking monster of possibilities.
I've been working on using OpenCL to do non-conventional rendering techniques and my exploration has lead me toward needing to implement an instance of the Marching Cubes algorithm. Something I've done before. What has me tied up this time is the vast number of possible parallel approaches at arms reach. I can at least vaguely reason about how difficult the various approaches are to implement, but when it comes to predicting the expense of them I'm somewhat in the dark. Big O notation can't help me here.
To give you an idea, one approach is a to fit marching cubes into a map reduce pattern - which is temping for it's conceptual simplicity but unfortunately reduction can't be completely automated in OpenCL and would require a variable number of reduction passes depending on the data size. It ends up more complicated than it seems. Another option is to essentially put a lock around two buffers and fill them up as workers finish, but I have no idea how much congestion this would cause. I could even do most of the computation in a very fast, single dispatch, but then I would need a final clean-up pass which would have to run sequentially and may end up very expensive. None of the approaches are really trivial enough to just mock up and test out, and I can't find a whole bunch of testimonies either.
Until programmers can begin to reason about the expense of parallel programming without resorting to benchmarks, it is always going to be somewhat of a struggle for them to adopt it. This is something that hardware developers, language developers and academia all need to work on together.
For a great lecture on the subject and fascinating delve into the hardware take a look at this by Bill Dally.