# UBCO COSC370, BSPlib project: Week 4: Sorting

This week you will write you will write two sorting algorithms : radix and Tiskin-McColl sort. They solve slightly different problems but both are interesting in their own right.

My specifications are getting a bit fuzzier which each week. If you have any questions, do not hesitate to email me for more details.

## 1 Radix sort

**Note:** I will use the notation \(< b^0, b^2, \ldots, b^{p-1} >\) in
this document to describe distributed arrays where \(b^i\) is stored in
processor \(i\). For instance \(< [1,2,3], [4,5,6], [7,8,9] >\) is a
distributed array where \(p=3\) and processor 0 has the local block containing
\([1,2,3]\).

In this algorithm we sort by the key a set of \(N\) key-value pairs distributed over the processors. For the sake of simplicity, we let both keys and values be integers. The assumption behind radix-sort which greatly simplifies sorting is that (1) all keys are unique and (2) the value of the greatest key is known. This allows us to simply iterate over the input array and send each key-value pair to the corresponding correct placement in the sorted array.

We specify the algorithm thus:

**Input:**

- Keys \(K\): A distributed array of integers of length \(N\) divided in blocks of size \(m\) (each processor has \(m\) values) of keys \(K = < [ [k^1_1, ..., k^1_m], [k^2_1, ..., k^2_m], ..., [k^p_1, ..., k^p_m] ] >\) i.e. the key \(k^i_j\) is the \(j\):th key on processor \(i\). We will assume that the keys are the integers \([0,...,N-1]\) which simplifies the algorithm.
- Values \(V\): an array on the same form as \(K\), i.e.: A distributed array of integers of length \(N\) divided in blocks of size \(m\) of values \(V = < [v^1_1, ..., v^1_m], [v^2_1, ..., v^2_m], ..., [v^p_1, ..., v^p_m] >\) i.e. the value \(v^i_j\) is the \(j\):th value on processor \(i\). No other assumption is made on \(V\) which can be generated randomly.

**Output:**

- \(K'\) and \(V'\) such that (1) \(K'\) is sorted in increasing order and such that each value in \(V'\) is still associated with the same key.

Example (w/ \(p=4, N=8, m=2\)):

Input:

- \(K = < [3, 1], [7, 2], [5, 4], [6, 0] >\)
- \(V = < [9, 2], [4, 1], [2, 4], [0, 2] >\)

Output:

- \(K' = < [0, 1], [2, 3], [4, 5], [6, 7] >\)
- \(V' = < [2, 2], [1, 9], [4, 2], [0, 4] >\)

**Task 1:** Implement an algorithm satisfying this specification. Note
that each processor only have to iterate once over their input
arguments.

## 2 Tiskin-McColl sort

This sorting algorithm is more complicated than the previous but allows us to sort efficiently arrays without using the assumptions of the radix sort. That is, we do not have to assume a known greatest value of the values to sort.

**Prerequisites:**

- Doing sorting in C: https://www.tutorialspoint.com/c_standard_library/c_function_qsort.htm
- Merging lists and retaining sort efficiently: https://en.wikipedia.org/wiki/Merge_algorithm

**Specification:**

**Input :**

- \(V\): A distributed array of \(N\) integers in blocks of size \(m\) \(V = < [v^1_1, ..., v^1_m], [v^2_1, ..., v^2_m], ..., [v^p_1, ..., v^p_m] >\)

The array cannot contain any duplicates, i.e. all values must be unique.

**Output:**

- \(V'\) which is \(V\) sorted, i.e. \(V' = < [v'^1_1, ..., v'^1_m], [v'^2_1, ..., v'^2_m], ..., [v'^p_1, ..., v'^p_m] >\) such that for all \(k\) and \(i\), \(V'^{i}_{k} < V'^{i}_{k+1}\) and furthermore forall \(i, k\), \(V'^{i}_{k} < V'^{i+1}_{0}\).

**Algorithm:**

The algorithm proceeds in several phases which I describe briefly:

- Local sort: each processor sorts their local block.
- Local sample: each processor takes \(p+1\)
*local*samples in local block. This can be any \(p+1\) values from their local block. - Local sample exchange: Each processor sends their samples to each other processor.
- Local sample merge: Each process now has \(p*(p+1)\) local samples. Each process merges the lists of samples while retaining the sort, using the method linked in the prerequisites.
- Global sampling: Take \(p+1\) samples from the \(p*(p+1)\) samples in a
way so that all processors have the same samples. This is done
without requiring any communication by having e.g. processor \(0\)
take local sample \(0\), processor \(1\) takes sample \(p\), and
processor \(i\) takes sample \(i*p\). These samples are called
*global*separators. - Routing: In this stage, each processors sends the local data the processor where it will appear in the final result. The separators allows the processors to know where it must send each element. If the element is \(v\), such that \(s_i <= v < s_{i+1}\) then the data is sent to process \(i\). To do this efficiently, each processor creates \(p\) lists: one for each processor. It puts its local elements in the list corresponding to where it must be sent, while retaining the elements sorted. It then sends each list to the corresponding processor.
- Final sort: In this last step, each processor will have received \(p\) lists: one from each processor. Since all received lists are sorted, they can be efficiently merged while retaining the sort. The result will be sorted the distributed array.

**Example:**

An example executions with (\(p = 3\), \(N = 15\) and so \(m = 5\)):

Input: \[ V = < [27, 18, 0, 4, 11], [21, 19, 30, 16, 28], [8, 5, 22, 17, 9] > \]

- Local sort:

\[ < [ 0, 4, 11, 18, 27], [16, 19, 21, 28, 30], [5, 8, 9, 17, 22] > \]

- Each processors picks local samples (underlined):

\[ < [ \underline{0}, 4, \underline{11}, \underline{18}, \underline{27} ], [ \underline{16}, 19, \underline{21}, \underline{28}, \underline{30} ], [\underline{5}, 8, \underline{9}, \underline{17}, \underline{22} ] > \]

- Exchange local samples.
- Each processor now has \(p*(p+1)\) local samples, that is 12:

\[ LS = [ [ \underline{0}, \underline{11}, \underline{18}, \underline{27} ], [ \underline{16}, \underline{21}, \underline{28}, \underline{30} ], [\underline{5}, \underline{9}, \underline{17}, \underline{22} ] ] \] they merge them and obtain \[ LS = [0, 5, 9, 11, 16, 17, 18, 21, 22, 27, 28, 30] \]

- Now each processor picks the same global separators:

\[ LS = [ \underline{0}, 5, 9, 11, \underline{16}, 17, 18, 21, \underline{22}, 27, 28, \underline{30} ] \] and discards the remaining \[ LS = [ \underline{0}, \underline{16}, \underline{22}, \underline{30} ] \]

- First each processor partitions their local-block according to the global separators.

\[ PS = < [ [0, 4, 11], [ 16 ], [5, 8, 9] ], [ [ 18 ], [19, 21], [ 17]], [[27 ], [28, 30], [ 22 ]] > \] Meaning that, processor \(0\) has to send \([0, 4, 11]\) to processor (to itself), \([ 16 ]\) to processor \(1\) and \([5, 8, 9]\) to processor 2. Processor 1 has to send \([ 18 ]\) to 0 and \([19, 21]\) to processor 1, etc. Now these exchanges are performed.

- Now we will have:

\[ < [[0, 4, 11], [ 16 ], [5, 8, 9]], [[ 18 ], [19, 21], [ 17 ]], [[27], [28, 30], [ 22 ]] > \] Each processor merges the lists they have received: \[ < [0, 4, 5, 8, 9, 11, 16], [17, 18, 19, 21], [22, 27, 28, 30, 22] > \]

and this is the final result.

For more details:

- See the thesis of Tiskin, page 27. Exact but not so easy to read.
- This text-book chapter, but it is in French.

**Task 1:** Implement the above algorithm.

**Task 2:** The above algorithm cannot handle duplicates in the input
array. This is be fixed by adding a pre-processing that transforms the
data so that all values are guaranteed to be unique and then and
post-processing step that removes the effect of the
pre-processing. This pre-/post-processing can be formed using some
arithmetic on the value and the index of the value, but I let you
think about it. There's a hint in the next paragraph which is visible
if you select it.

One way is letting the original element \(x_i\) become be \(x'_i = x_i * B + i\), where \(B\) is a constant that is larger than the number of elements. The post-processing step for each final, sorted value \(y_i\) is then \(y'_i = y_i / B\). For instance, if we have the list [1,2,59,34,59], we can let \(B = 100\) and obtain \([100, 201, 5902, 3403, 5904]\). The sorted array will be \([100, 201, 3403, 5902, 5904]\). After post-processing, we have \([100, 201, 3403, 5902, 5904]\)

**Task 3:** (Ambitious). Change the algorithm so that it sorts
strings. For this, you will have to use `strcmp`

, and several other
changes. Then have the processor 0 read words from a file (I gave some
hints on how to do this in the Week 2: Task 3). Of course, the file
might contain duplicate words, a solution similar to that in **Task 2**
will have to be implemented to work around this (if you skipped that
one, assume no doubles). Then let processor 0 broadcast this array to
the others, and sort them. Collect the result on processor 0, and
write them back to another file.

–

Back to overview.