Whose job is it: the hardware’s, software’s, or programmer’s?

Perhaps the biggest question in parallel computing for Big Data is, “Who’s responsible for the logical work to harness parallel architectures: the hardware, the compiler, or the programmer?”

Today I found an interesting lecture on MIT’s Open Courseware titled L3: Introduction to Parallel Architectures given as part of the “Multicore Programming Primer” IAP course by Saman Amarasinghe. The lecture discusses high-level parallel computing architectures from the past 50 years of parallel computing and lends insight into this question.

The question is about the distinction between Implicit and Explicit Parallelism

Implicit Parallelism is where compilers and hardware try to extract parallelism in the source program. Do you Read After Writing? (Known as a RAW hazard–a form of dependence that limits how much parallelism can be created.) Write After Read? Write After Write hazard? There are a number of clever tricks when analyzing code such as “register renaming” where the software essentially writes the same variable to two different spaces if one isn’t ever used, thereby avoiding a bottleneck.

Searching for a lack of control dependencies is another trick. For example, does a series of “if statements” require particular ordering? Or can you reshuffle them if there is no internal dependency? This can allow a block of logic to be implicitly parallelized by a compiler that recognizes this control independence.

Speculation” is another approach–essentially the computer makes educated guesses or speculates by testing the multiple branches of logic simultaneously and then follows the branch that evaluated true. So, while the program may not be explicitly parallel, by doing several steps of extra work the computer can avoid having to take the extra time to back track and explore all of the logical branches in sequence. This doesn’t save on computation, but it does speed up the program.

To be clear, there is a lot of energy and cycle losses trying to uncover and execute implicit parallelism. The low hanging fruit has been picked: we see that even thought transistors double, performance only increases marginally (say, 5%). Both compilers (which have more freedom to see the big picture of a program) and processors (which generally make use of speculation) both look to exploit implicit parallelism. But things are moving more and more toward explicit parallelism to avoid this waste.

Explicit Parallelism is where the programmer explicitly dictates parallel logic. An important concept about the limitations of parallelism is called Little’s Law: Throughput per cycle * Latency in Cycles. This law describes the tradeoff between the parallel throughput (the granularity of the parallelism) and the latency of a cycle (how long it takes to complete the segment of code).

To better understand Explicit (and Implicit) Parallelism, four high-level categories of logical parallelism were presented: Pipelining, Data-Level Parallelism, Thread-Level Parallelism, Instruction-Level Parallelism. Pipelining involves the staggering of processes, Data-Level Parallelism involves processing large layers of similar objects through similar steps, Thread-Level parallelism involves computing different but independent series of commands, and Instruction-Level Parallelism is splitting a single series of commands into their independent and dependent layers. The screenshot provides a high-level conceptualization of these four categories.

Old super computers like the Cray I weren’t actually independently parallel as they had limited internal communication elements required for more complex parallel processing. They simply pipelined operations much like industrial production does batch processing to eliminate bottlenecks and realize economies of scale.

So, what are the big takeaways? The “era of programmers not caring about what is under the hood is over.” Sloppy code’s performance traditionally was mitigated by the hardware just getting faster without any additional explicit input from the programmer. Now, thanks to developments in task-oriented optimization of hardware (like graphics chips), “understanding the hardware will make it easier to make programs get high performance.” However, care needs to paid to ensure that software is general and portable (similar performance on different machines).

So, what to do? Do we accept today’s level of performance or learn the nasty guts of parallel programming? There is certainly a (currently unknown) middle ground where we will end up–but it’s still uncertain what that will look like.

My final question for the reader is “Can Big Data algorithms, programs, and estimation be approximated, adjusted, altered to ensure as much implicit parallelism as possible, or does explicit parallelism be need to be enforced?” In the end, I think we’ll find ourselves using all of these tools–adapting statistical algorithms to be as natively parallel as possible, explicitly coding the parallelism as much as possible, and finally relying on the compilers and processors to work their additional magic to find low level logical parallelism that we overlooked and speculate on the fly efficiently with available processing resources.

The challenges of parallel programming are as much about the hardware and software as they are about what exactly we want the program to do. Perhaps the easiest gains to be had are by simply asking the computer to do less work by doing more of the right work. That burden will ultimately lie on the theory and algorithms implemented in the code.

How important is understanding parallel structures for Big Data business and economic researchers? Whose job is it? Will Moore’s Law win? What do you think?