2021S342295: Assignment 2 (May 25)

JavaScript is disabled in your browser.
Many features of Moodle will be not usable or will appear to be broken.
Please enable JavaScript for the full Moodle experience.
Due: Tuesday, 25 May 2021, 11:59 PM
The purpose of this assignment is to help you understand some of the
most important issues in low-level shared-memory parallel programming.

The example simulates partitioning jobs and counting these jobs
as operations.  The provided version has several flaws which you
are to supposed to fix.  There is a corresponding chapter in the
'perfbook' (by Paul McKenney) and it also relates to the datarace
screencast I published.

First, please try to get hold of a desktop, laptop or small server with
Linux, Unix, or Ubuntu and a modern 'gcc' installed (Ubuntu on Windows
is also possible).

Then unpack the provided zip file and run

  ./configure.sh && make && ./bogus 1 9 # optimized code

which should compile the existing 'bogus' version of this exercise,
run it, test that it works and print timing information. Then try:

  ./configure.sh -g && make && ./bogus 1 9 # non-optimized code

Please, prepare a report answering the following questions.  The report
should be submitted together with the actual source code in a zip file.
The zip file should also contain the output generated by your programs.
You are supposed to work on your own.  Everybody has to hand in her
or his own version and might be ased to present it.

The assignment has 7 questions (percentage achievable points in brackets).


[1] Set-Up [10%]
----------------

Describe your system including the processor, the number of cores,
the operating system version and compiler version.


[2] Analysis of 'bogus' [10%]
-----------------------------

What running times do you get for both version?

  ./configure.sh && make && ./bogus 1 9
  ./configure.sh -g && make && ./bogus 1 9

Now try the same with two threads:

  ./configure.sh && make && ./bogus 2 9
  ./configure.sh -g && make && ./bogus 2 9

Do the times decrease/increase?  Do you get correct or incorrect counting?

What are the three data races in this 'bogus' program.


[3] Fix two of the three data-races by using volatile [10%]
-----------------------------------------------------------

Two of the data races can be fixed by telling the compiler not to assume
anything about accessing memory and in particular not optimizing away
the access (reading and writing a variable or pointer).  Use the 'volatile'
keyword to achieve this or an 'ACCESS' macro (as in the 'perfbook').

  #define ACCESS(P) (* (volatile typeof (P) *) &(P))

The resulting program should be called 'incorrect.c' (since it still has
a data race). Just place it in the same directory.  The makefile should
automatically compile it too. Compare running time with those from '2' with:

  ./configure.sh && make && ./incorrect 2 9
  ./configure.sh && make && ./incorrect 1 9

Do you get speed-up?


[4] Fix the main data race through locking [10%]
------------------------------------------------

Use 'pthread_mutex_t' as in the demo for the data-race and call the resulting
program 'locked.c' and run the tests again.  This should give the correct result
for two threads.  How much slower are these new versions?  Run it also for
3 and 4 threads to a getter better picture about speed-up.


[5] Instead of locking use atomic increment [10%]
-------------------------------------------------

Call this program 'atomic.c'.  You probably want to use the now deprecated
'__sync...' GCC atomic functions instead of C11 atomics (or even C++11 code).
What are the running times (again for 1,2,3,4 worker threads)?


[6] Progress printing [30%]
---------------------------

Start an additional thread which sleeps for 1/100 of a second and then prints
the total accumulated counts in percent (use for instance 'usleep' for
sleeping). Use carriage return '\r' instead of '\n' to print the percentage
always at the start of the same line (or terminal codes).  Then add a
'fflush' call.  Devise a method to stop this thread after all workers have
finished their operations.  You can of course also do a real progress bar if
you fancy that.  This program is called 'progress.c'.  You can base this
part on any of the versions before ('bogus.c', 'incorrect.c', 'locked.c',
'atomic.c').  What is the overhead of printing?


[7] Thread local counting [20%]
-------------------------------

Move to a thread local counting scheme 'local.c' (start with 'incorrect.c'),
where each worker thread has its own result, which after a worker has
finished is added to the global result.  Adapt the progress printing thread
to peek into local thread counters and compute a global count result.  There
are many ways to do this.  Again see the corresponding chapter 5 in the
'perfbook'.  You can try all but one is enough (the simplest statistical
inexact counting version).  A simple solution is to add local counters
to the 'worker' structure.  But then be careful with 'false sharing', where
different threads write to the same cache line.  Accordingly experiment with
padding (to put those counters into different cache lines).
exercise2.zip
8 February 2021, 8:58 AM