UBCO COSC370, BSPlib project: Week 4: Sorting

This week you will write you will write two sorting algorithms : radix and Tiskin-McColl sort. They solve slightly different problems but both are interesting in their own right.

My specifications are getting a bit fuzzier which each week. If you have any questions, do not hesitate to email me for more details.

1 Radix sort

Note: I will use the notation \(< b^0, b^2, \ldots, b^{p-1} >\) in this document to describe distributed arrays where \(b^i\) is stored in processor \(i\). For instance \(< [1,2,3], [4,5,6], [7,8,9] >\) is a distributed array where \(p=3\) and processor 0 has the local block containing \([1,2,3]\).

In this algorithm we sort by the key a set of \(N\) key-value pairs distributed over the processors. For the sake of simplicity, we let both keys and values be integers. The assumption behind radix-sort which greatly simplifies sorting is that (1) all keys are unique and (2) the value of the greatest key is known. This allows us to simply iterate over the input array and send each key-value pair to the corresponding correct placement in the sorted array.

We specify the algorithm thus:

Input:

  • Keys \(K\): A distributed array of integers of length \(N\) divided in blocks of size \(m\) (each processor has \(m\) values) of keys \(K = < [ [k^1_1, ..., k^1_m], [k^2_1, ..., k^2_m], ..., [k^p_1, ..., k^p_m] ] >\) i.e. the key \(k^i_j\) is the \(j\):th key on processor \(i\). We will assume that the keys are the integers \([0,...,N-1]\) which simplifies the algorithm.
  • Values \(V\): an array on the same form as \(K\), i.e.: A distributed array of integers of length \(N\) divided in blocks of size \(m\) of values \(V = < [v^1_1, ..., v^1_m], [v^2_1, ..., v^2_m], ..., [v^p_1, ..., v^p_m] >\) i.e. the value \(v^i_j\) is the \(j\):th value on processor \(i\). No other assumption is made on \(V\) which can be generated randomly.

Output:

  • \(K'\) and \(V'\) such that (1) \(K'\) is sorted in increasing order and such that each value in \(V'\) is still associated with the same key.

Example (w/ \(p=4, N=8, m=2\)):

Input:

  • \(K = < [3, 1], [7, 2], [5, 4], [6, 0] >\)
  • \(V = < [9, 2], [4, 1], [2, 4], [0, 2] >\)

Output:

  • \(K' = < [0, 1], [2, 3], [4, 5], [6, 7] >\)
  • \(V' = < [2, 2], [1, 9], [4, 2], [0, 4] >\)

Task 1: Implement an algorithm satisfying this specification. Note that each processor only have to iterate once over their input arguments.

2 Tiskin-McColl sort

This sorting algorithm is more complicated than the previous but allows us to sort efficiently arrays without using the assumptions of the radix sort. That is, we do not have to assume a known greatest value of the values to sort.

Prerequisites:

Specification:

Input :

  • \(V\): A distributed array of \(N\) integers in blocks of size \(m\) \(V = < [v^1_1, ..., v^1_m], [v^2_1, ..., v^2_m], ..., [v^p_1, ..., v^p_m] >\)

The array cannot contain any duplicates, i.e. all values must be unique.

Output:

  • \(V'\) which is \(V\) sorted, i.e. \(V' = < [v'^1_1, ..., v'^1_m], [v'^2_1, ..., v'^2_m], ..., [v'^p_1, ..., v'^p_m] >\) such that for all \(k\) and \(i\), \(V'^{i}_{k} < V'^{i}_{k+1}\) and furthermore forall \(i, k\), \(V'^{i}_{k} < V'^{i+1}_{0}\).

Algorithm:

The algorithm proceeds in several phases which I describe briefly:

  1. Local sort: each processor sorts their local block.
  2. Local sample: each processor takes \(p+1\) local samples in local block. This can be any \(p+1\) values from their local block.
  3. Local sample exchange: Each processor sends their samples to each other processor.
  4. Local sample merge: Each process now has \(p*(p+1)\) local samples. Each process merges the lists of samples while retaining the sort, using the method linked in the prerequisites.
  5. Global sampling: Take \(p+1\) samples from the \(p*(p+1)\) samples in a way so that all processors have the same samples. This is done without requiring any communication by having e.g. processor \(0\) take local sample \(0\), processor \(1\) takes sample \(p\), and processor \(i\) takes sample \(i*p\). These samples are called global separators.
  6. Routing: In this stage, each processors sends the local data the processor where it will appear in the final result. The separators allows the processors to know where it must send each element. If the element is \(v\), such that \(s_i <= v < s_{i+1}\) then the data is sent to process \(i\). To do this efficiently, each processor creates \(p\) lists: one for each processor. It puts its local elements in the list corresponding to where it must be sent, while retaining the elements sorted. It then sends each list to the corresponding processor.
  7. Final sort: In this last step, each processor will have received \(p\) lists: one from each processor. Since all received lists are sorted, they can be efficiently merged while retaining the sort. The result will be sorted the distributed array.

Example:

An example executions with (\(p = 3\), \(N = 15\) and so \(m = 5\)):

Input: \[ V = < [27, 18, 0, 4, 11], [21, 19, 30, 16, 28], [8, 5, 22, 17, 9] > \]

  1. Local sort:

\[ < [ 0, 4, 11, 18, 27], [16, 19, 21, 28, 30], [5, 8, 9, 17, 22] > \]

  1. Each processors picks local samples (underlined):

\[ < [ \underline{0}, 4, \underline{11}, \underline{18}, \underline{27} ], [ \underline{16}, 19, \underline{21}, \underline{28}, \underline{30} ], [\underline{5}, 8, \underline{9}, \underline{17}, \underline{22} ] > \]

  1. Exchange local samples.
  2. Each processor now has \(p*(p+1)\) local samples, that is 12:

\[ LS = [ [ \underline{0}, \underline{11}, \underline{18}, \underline{27} ], [ \underline{16}, \underline{21}, \underline{28}, \underline{30} ], [\underline{5}, \underline{9}, \underline{17}, \underline{22} ] ] \] they merge them and obtain \[ LS = [0, 5, 9, 11, 16, 17, 18, 21, 22, 27, 28, 30] \]

  1. Now each processor picks the same global separators:

\[ LS = [ \underline{0}, 5, 9, 11, \underline{16}, 17, 18, 21, \underline{22}, 27, 28, \underline{30} ] \] and discards the remaining \[ LS = [ \underline{0}, \underline{16}, \underline{22}, \underline{30} ] \]

  1. First each processor partitions their local-block according to the global separators.

\[ PS = < [ [0, 4, 11], [ 16 ], [5, 8, 9] ], [ [ 18 ], [19, 21], [ 17]], [[27 ], [28, 30], [ 22 ]] > \] Meaning that, processor \(0\) has to send \([0, 4, 11]\) to processor (to itself), \([ 16 ]\) to processor \(1\) and \([5, 8, 9]\) to processor 2. Processor 1 has to send \([ 18 ]\) to 0 and \([19, 21]\) to processor 1, etc. Now these exchanges are performed.

  1. Now we will have:

\[ < [[0, 4, 11], [ 16 ], [5, 8, 9]], [[ 18 ], [19, 21], [ 17 ]], [[27], [28, 30], [ 22 ]] > \] Each processor merges the lists they have received: \[ < [0, 4, 5, 8, 9, 11, 16], [17, 18, 19, 21], [22, 27, 28, 30, 22] > \]

and this is the final result.

For more details:

Task 1: Implement the above algorithm.

Task 2: The above algorithm cannot handle duplicates in the input array. This is be fixed by adding a pre-processing that transforms the data so that all values are guaranteed to be unique and then and post-processing step that removes the effect of the pre-processing. This pre-/post-processing can be formed using some arithmetic on the value and the index of the value, but I let you think about it. There's a hint in the next paragraph which is visible if you select it.

One way is letting the original element \(x_i\) become be \(x'_i = x_i * B + i\), where \(B\) is a constant that is larger than the number of elements. The post-processing step for each final, sorted value \(y_i\) is then \(y'_i = y_i / B\). For instance, if we have the list [1,2,59,34,59], we can let \(B = 100\) and obtain \([100, 201, 5902, 3403, 5904]\). The sorted array will be \([100, 201, 3403, 5902, 5904]\). After post-processing, we have \([100, 201, 3403, 5902, 5904]\)

Task 3: (Ambitious). Change the algorithm so that it sorts strings. For this, you will have to use strcmp, and several other changes. Then have the processor 0 read words from a file (I gave some hints on how to do this in the Week 2: Task 3). Of course, the file might contain duplicate words, a solution similar to that in Task 2 will have to be implemented to work around this (if you skipped that one, assume no doubles). Then let processor 0 broadcast this array to the others, and sort them. Collect the result on processor 0, and write them back to another file.

Back to overview.

Author: Arvid Jakobsson (arvid.jakobsson@huawei.com)

Created: 2018-03-09 ven. 16:25

Emacs 25.2.2 (Org mode 8.2.10)

Validate