

# PARALLEL COMPUTING Algorithms and Complexity



Univ.-Prof. Dr. Alois Zoitl LIT | Cyber-Physical Systems Lab Johannes Kepler University Linz













#### To whom honor is due....

These slides are based on a slide deck from

**Prof. Dr. Armin Biere** 

from whom I took over this lecture. He deserves thanks for his kind permission to use them.

#### **My Background – Embedded Real-time Computing**



Eclipse 4diac: https://www.eclipse.dev/4diac



#### **Synchronisation Penalty**

Fig. 10 of Implementing Constrained Cyber-Physical Systems with IEC 61499. Yoong, Roop, and Salcic, http://dx.doi.org/10.1145/2362336.2 362345



### Need for Parallelization: End of Moores Law on Single Core















#### **35 Years of Microprocessor Trend Data**



Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten Dotted line extrapolations by C. Moore

Source: Chuck Moore, Data Processing in Exascale-Class Computer Systems, 2011



#### **Playstation 3: Cell Processor**





Source: US Department of Defense

Source: Wipedia



#### **Overview Current CPU Architetures**

- Intel
  - ☐ i5: 6 P / 8 E
  - □ i7: 8 P / 12 E
  - ☐ i9: 8 P / 16 E
- AMD
  - ☐ Ryzen: 4 96
- Raspberry Pi
  - ☐ Since Mod 2: 4

- Apple
  - $\square$  M1: 4 16 P / 4 E
  - $\square$  M2: 4 16 P / 4 8 E
  - $\square$  M3: 4 12 P / 4 E
  - □ M4: 4 − 12 P / 4 − 10 E / 10 − 40 GPU
- Nvidia
  - ☐ Tesla: 128 18,176 Cuda cores

# Parallelizing Existing Algorithms















#### **Slow-Down in Parallel SAT**

- Parallel Multithreaded Satisfiability Solver:
  Design and Implementation.
  Yulik Feldman, Nachum Dershowitz, Ziyad Hanna http://dx.doi.org/10.1016/j.entcs.2004.10.020
- Paper is inconclusive about the reason for slowdown
- Probably more threads work on useless sub-tasks
- Sharing clauses caching sub-computation increases pressure on memory system
- Maybe search space splitting was not a good idea (guiding path)

 ${\it Table \ 2}$  Performance of SAT solver with different numbers of working threads

| Configuration | One | Two | Three | Four | Four:One |
|---------------|-----|-----|-------|------|----------|
| A             | 13  | 15  | 61    | 89   | 6.8      |
| В             | 20  | 21  | 42    | 47   | 2.4      |
| С             | 14  | 16  | 19    | 22   | 1.6      |
| D             | 13  | 15  | 14    | 15   | 1.2      |
| E             | 7   | 7   | 7     | 10   | 1.4      |
| F             | 8   | 20  | 27    | 53   | 6.6      |
| G             | 6   | 55  | 195   | 168  | 28.0     |
| Н             | 6   | 52  | 86    | 107  | 17.8     |

#### Low Speedup in Parallel SAT

- http://www.birs.ca/events/2014/5-day-workshops/14w5101/videos/watch/201401221154-Sabharwal.html slide 4 of (video 3:30)
- Sequential SAT algorithms produce proofs of large depth (= span)
- So need new algorithms which produce low depth proofs



#### **Limiting Factor Memory Access?**





#### **Memory System is Good Enough**

- Analysis of Portfolio-Style Parallel SAT Solving on Current Multi-Core Architectures. Martin Aigner, Armin Biere, Christoph Kirsch, Aina Niemetz, Mathias Preiner. http://fmv.jku.at/papers/AignerBiereKirsch NiemetzPreiner-POS13.pdf
- Largest speed-up obtained by portfolio approach
  - ☐ Run different search strategies in parallel
  - $\square$  If one terminates stop all
  - ☐ In practice share some important learned clauses caching sub-computations



Figure 8: Absolute runtime required for an increasing number of parallel jobs solving the narain-vpn-clauses-10 benchmark on the amd-opteron-2350-8vcores machine.

- Slow-down due to memory system?
  - ☐ Since memory system (memory / caches / bus) are shared in multi-core systems
  - ☐ Slow-down not too bad (particularly for solvers with small working set)
  - $\square$  Even though considered memory-bound (but random access)
  - ☐ Waiting time for memory to arrive overlaps



#### **Clever Splitting**

- Marijn Heule, Oliver Kullmann, Siert Wieringa, Armin Biere. Cube and Conquer: Guiding CDCL SAT Solvers by Lookaheads. Haifa Verification Conference 2011: 50-65, Springer 2012 http://dx.doi.org/10.1007/978-3-642-34188-5\_8
- Marijn J.H. Heule, Oliver Kullmann, and Victor Marek Solving and Verifying the boolean Pythagorean Triples problem via Cube-and-Conquer. SAT 2016, 196-211, Springer 2016 http://dx.doi.org/10.1007/978-3-319-40970-2\_15
- Everything is Bigger in Texas https://www.cs.utexas.edu/~marijn/ptn/ JKU CS Colloquium 22. June 2016



### **Theory on Parallelizabilty**















#### **Work and Span**







#### **Amdahl's Law with Work and Span**

T = work = sequential time  $T_p = wall-clock time p CPUs$   $T_{\infty} = wall-clock time \infty CPUs$ 

- Speedup:  $S_P = T/T_P$
- Span ... critical path (also called "makespan" in the context of scheduling)
- $\blacksquare$   $f \dots$  fraction of sequential work, thus

f = span / work

Simplified Amdahl's law in terms of work and span:  $S_p \le 1/f = \text{work /span}$ 

- Reduce *span* as much as possible:
  - ☐ keep sequential blocks short!

→ coarse grained locking is evil

□ keep sequential dependencies short!

→ (non-logarithmic) loops are evil

#### **Pebble Games**

- Given a directed acyclic graph with one sink.
- Nodes of the graph have a pebble or not.
- One step can either . . .
  - ... remove a pebble from a node ...
  - ... or add a new pebble to a node without one, ...
  - ... but only if all its predecessor have a pebble.
- Goal is to only have a pebble on the sink node.
- What is the smallest maximum number of pebbles needed?
- Common concept in complexity theory
  - ☐ Assuming intermediate results have to be stored
  - ☐ Relates to smallest p needed to reach maximum speed-up
  - ☐ This version (black pebble game) actually only gives space bounds



#### Sum

- Compute sum  $\sum_{i=1}^{n} x_i$  for *n* numbers xi in parallel
- Sequential

  - $\square$  work = T = O(n) (n 1 additions)
  - $\square$  span = O(n) too
  - $\Box$  Since y<sub>i+1</sub> depends on all previous y<sub>j</sub> with j ≤ i
  - $\square$  Thus no speed-up Sp = O(1)
- Parallel
  - ☐ Associativity allows to regroup computation
  - $\square$  Work = O(n) remains the same
  - $\square$  Span = O(log n) reduces exponentially
  - $\square$  Speed-up not ideal but  $S_n = O(n / \log n)$
  - $\square$  Note p > n does not make sense





#### Prefix / Scan

- Compute all sums  $s_j = \sum_{i=1}^{J} x_i$  for all  $j = 1 \dots n$  and again n numbers  $x_i$  in parallel
- Sequential version as in previous slide
- $\blacksquare$  Parallel version needs a second depth  $O(\log n)$  pass
- $\blacksquare$  Works even "in place" (first pass overwrites original  $x_i$ )
- But actual "wiring" complicated
- $\blacksquare$  Still span =  $O(\log n)$
- Basic algorithmic idea for many "parallel" algorithms
- Propagate and generate adders with prefix trees instead of ripple carry adder



#### Ripple-Carry-Adder

$$s_i = x_i \oplus y_i \oplus c_i$$
 sum  $c_{i+1} = x_i y_i + x_i c_i + y_i c_i$  carry

$$c_{0} = 0$$

$$s_{0} = x_{0} \oplus y_{0}$$

$$c_{1} = x_{0} y_{0}$$

$$s_{1} = x_{1} \oplus y_{1} \oplus c_{1}$$

$$c_{2} = x_{1} y_{1} + x_{1} c_{1} + y_{1} c_{1}$$

$$s_{2} = x_{2} \oplus y_{2} \oplus c_{2}$$

$$c_{3} = x_{2} y_{2} + x_{2} c_{2} + y_{2} c_{2}$$

$$s_{3} = x_{3} \oplus y_{3} \oplus c_{3}$$

$$c_{4} = x_{3} y_{3} + x_{3} c_{3} + y_{3} c_{3}$$

$$s_{4} = x_{4} \oplus y_{4} \oplus c_{4}$$

$$c_{5} = x_{4} y_{4} + x_{4} c_{4} + y_{4} c_{4}$$

$$s_{5} = x_{5} \oplus y_{5} \oplus c_{5}$$

$$c_{6} = x_{5} y_{5} + x_{5} c_{5} + y_{5} c_{5}$$

$$s_{6} = x_{6} \oplus y_{6} \oplus c_{6}$$

$$c_{7} = x_{6} y_{6} + x_{6} c_{6} + y_{6} c_{6}$$

$$s_{7} = x_{7} \oplus y_{7} \oplus c_{7}$$

$$c_{8} = x_{7} y_{7} + x_{7} c_{7} + y_{7} c_{7}$$

$$work = \mathcal{O}(n)$$
  $span = \mathcal{O}(n)$ 





#### Propagate-and-Generate Adder / Lookahead Adder

```
x_i + y_i
                         propagate
                         generate
     = x_i y_i
c_{i+1} = g_i + p_i c_i new carry computation formula
c_0
c_1
                                                                  g_1 +
c_2
                                                                                p_1 g_0
                                                g_2 + p_2 g_1 + p_2 p_1 g_0
c_3
                            g_3 + p_3 g_2 + p_3 p_2 g_1 + p_3 p_2 p_1 g_0
                   g_4 + p_4 g_3 + p_4 p_3 g_2 + p_4 p_3 p_2 g_1 + p_4 p_3 p_2 p_1 g_0
c_5
     = g_5 + p_5 g_4 + p_5 p_4 g_3 + p_5 p_4 p_3 g_2 + p_5 p_4 p_3 p_2 g_1 + p_5 p_4 p_3 p_2 p_1 g_0
     = g_6 + \dots
                                                       \dots p_6 p_5 p_4 p_3 p_2 p_1 g_0
     = g_7 + \dots
                                                    + \qquad \qquad p_7 \, p_6 \, p_5 \, p_4 \, p_3 \, p_2 \, p_1 \, g_0
work = \mathcal{O}(n^2) span = \mathcal{O}(\log n) assuming n-ary gates otherwise work = \mathcal{O}(n^3)
```

#### **Carry-Look-Ahead Adder**





# Wallace-Tree Multiplier with Final Stage Carry-Look-Ahead Adder



$$work = \mathcal{O}(n^2)$$
  $span = \mathcal{O}(\log n)$ 



#### **List Ranking / Pointer Jumping**



determine distance to head of list:

as long there is i with  $next[i] \neq \bot$ : val[i] += val[next[i]] next[i] = next[next[i]]

#### **Sorting Networks**

- Circuits for sorting fixed number n of inputs  $\Box$  Basic "gate" compare-and-swap: Cmpswap(x, y) := (min(x, y), max(x, y))  $\Box$  Interesting challenge to get smallest sorting network

  for n = 11 size only known to be between 33 and 35 compare-and-swap operations
- Zero-one principle
  - ☐ Correctness of sorting network (it sorts!) . . . . . only requires sorting 0 and 1 inputs (bits) . . . . . . as long only compare-and-swap is used.
- Asymptotic complexity of algorithms
  - ☐ Examples: Bitonic Sorting, Batcher Odd-Even Merge Sort
  - $\square$  with span =  $O(\log^2 n)$
  - $\square$  with work =  $O(n \cdot log^2 n) = T1$
  - $\square$  but sequential time  $T = O(n \cdot \log n)$
  - $\square$  maximum absolute speed-up  $S_n = O(n/\log n)$



#### **Bubble Sort Example**



- $\blacksquare$  Top-most *i* sorted after *i* phases
- Lowest value only sorted after n-1 compare-and-swaps
- $\blacksquare$  span = O(n)
- Looks like perfect speedup  $S_n = O(n)$  w.r.t. (bad) sequential algorithm
- However, if we compare against Quicksort  $T = O(n \cdot log n)$  we only get

$$S_n = \mathcal{O}(\frac{n \cdot \log n}{n}) = \mathcal{O}(\log n) < \mathcal{O}(n/\log n)$$

#### **Batcher Odd-Even Merge Sort**



- Basically as merge sort
  - $\square$  Split input into two parts . . .
    - ... sort parts recursively ...
    - ... merge sorted sequence.
- **Example**: recursion for n = 8
  - ☐ Outer block takes two sorted sequences of size 4 each
  - ☐ Each inner block takes two sorted sequences of size 2 each
  - ☐ Outer input sequences need to be sorted too

#### **Batcher Odd-Even Merge Sort**



#### NC - Nick's Class

- $\blacksquare$  f (n) polylogarithmic iff exists constant c such that  $f(n) = O(\log^c n)$
- NC is set of decision problems . . .
  - ... which can be decided in polylogarithmic time ...
  - ... on a parallel computer with polynomial many processors, i.e., ...
  - ... exists constant c such that  $p = O(n^k)$ .
- NC $^{c}$  requires (parallel) computation time (span) in  $O(log^{c} n)$
- $\blacksquare NC = \cup NC^c$

#### L, NL, AC

- L is set of decision problems solvable in logarithmic space deterministically
- **NL** is set of decision problems with logarithmic space non-deterministically
- $\blacksquare$  NC = AC is the set of decision problems with logarithmic circuit complexity, i.e., . . .
  - ... each input of size n can be decided by polynomial circuit with logarithmic depth in n, ...
  - ... made of gates with bounded (NC) or unbounded (AC) number of inputs
- $\blacksquare$  As before define NC<sup>c</sup> and AC<sup>c</sup> requiring O(log<sup>c</sup> n) depth (layers)



#### **P** Completeness

$$NC^1 \subseteq L \subseteq NL \subseteq AC^1 \subseteq NC^2 \subseteq AC^2 \subseteq NC^3 \subseteq \cdots \subseteq NC = AC \subseteq P$$

- Using "logarithmic" reductions
- It is commonly believed that NC ≠ P
- Accordingly P-hard problems are supposed to be NOT "parallelizable"
- Similar to the common belief that  $P \neq NP$

#### **Circuit Evaluation Problem**

- Given a boolean circuit with one output, and an evaluation to its inputs.
- Evaluate the circuit and determine its output value for that input assignment.
- This problem (deciding whether output yields one) is P-complete . . .
  - . . . and thus considered not to be parallelizable.
- Thus evaluating a function can not be done "effectively" in parallel.
- One step of simulation or constraint propagation are not parallelizable! (?)

### **Parallel Design Patterns**















# Guidelines and Methodologies for Implementing Parallel Programs







# **Examples from Real-Time Domain**















#### **Signal Processing Pipeline**



Frieb, M., Jahr, R., Ozaktas, H. et al. A Parallelization Approach for Hard Real-Time Systems and Its Application on Two Industrial Programs. Int J Parallel Prog 44, 1296–1336 (2016).

https://doi.org/10.1007/s10766-016-0432-7

#### **Control Program of BAUER MC 128**







Frieb, M., Jahr, R., Ozaktas, H. et al. A Parallelization Approach for Hard Real-Time Systems and Its Application on Two Industrial Programs. Int J Parallel Prog 44, 1296–1336 (2016). https://doi.org/10.1007/s10766-016-0432-7



#### **Control Program of BAUER MC 128**



**Fig. 17** Evaluation results: static WCETs and WCET speedup, **a** WCET for 1, 4 and 8 cores. At the 4 core version, the WCET falls to around 40% of the sequential version, while it rises to a multiple at the 8 core version, **b** WCET Speedup for 1, 4 und 8 cores. At 4 cores, the speedup reaches around 2.4, while at 8 cores a slowdown to around 0.15 shows up

Frieb, M., Jahr, R., Ozaktas, H. et al. A Parallelization Approach for Hard Real-Time Systems and Its Application on Two Industrial Programs. Int J Parallel Prog 44, 1296–1336 (2016). https://doi.org/10.1007/s10766-016-0432-7



## Thank you!

Univ.-Prof. Dr. Alois Zoitl, alois.zoitl@jku.at LIT | Cyber-Physical Systems Lab Johannes Kepler University Linz

