The *psrs* sorting algorithm consists of six distinct phases. It also uses
message passing communication paradigms of *send*, *receive*, *broadcast*, *scatter* and
*gather*:

*send*allows one processor to send a message to another processor,*receive*allows one processor to receive a message sent from another processor,*broadcast*allows one processor to send a message to all processors,*scatter*allows one processor to distribute pieces of data (say, an array) to all processors, and*gather*allows one processor to collect pieces of data (say an array) fromall processors.

- How could
*broadcast*be made more efficient? - What might you suspect is the difference between
*scatter*and*scatterv*?

This is a phase oriented description of the PSRS algorithm. This description also
specifies the communications paradigms. Notice that *partition*, *quicksort* and *multimerge*
are used in this description as well.

**Phase I: Initialization**

Start up \(p\) processors, let the root processor, *0*, get data of size \(n\).

**Phase II: Scatter data, local sort and regular samples collected**

*Scatter* the data values to the \(p\) processors. Each processors sorts its local
data set, roughly of size \(\frac{n}{p}\), using *quicksort*. Each processor chooses \(p\) sample points, in a very regular manner, from
its locally sorted data.

**Phase III: Gather and merge samples, choose and broadcast p-1 pivots**

The root processor, *0*, gathers the \(p\) sets of
\(p\) sample points. It is important to realize that each set of these \(p\) sample
points is sorted. These \(p\) sets are sorted using *multimerge*. From these \(p^2\)
sorted points, \(p-1\) pivot values are regularly chosen and are *broadcast* to the other \(p-1\)
processors.

**Phase IV: Local data is partitioned**

Each of the \(p\) processors *partitions* its local sorted data,
roughly of size \(\frac{n}{p}\), into \(p\) classes using
the \(p-1\) pivot values.

**Phase V: All *ith* classes are gathered and merged**

Processor *i* *gathers* the \(i^{th}\) class of data from every other processor. Each of these
classes is sorted using *multimerge*.

**Phase VI: Root processor collects all the data**

The root processor *gathers* all the data and assembles the sorted list of \(n\) values.

- Manually, sort a list of \(20\) integers, with \(4\) simulated processors, following these six phases.
- What can you say about the time complexity of each phase of this algorithm? What does this tell you about the time complexity of the algorithm?
- What can you say about the space requirements of each phase of this algorithm? How does this affect the time complexity?
- What can you say about the communication complexity of this algorithm?

The following figure illustrates this algorithm using \(1000\) data points and \(4\) processors. The communication between processors is annotated with the *MPI* functions *Broadcast*, *Gather*, *Gatherv*, and *Scatter*. As you study this figure, please realize that, in particular, *Gatherv* is rather complicated. The details of all these *MPI* functions will be discussed as part of the next section. For now, relate the six phases of the algorithm to the data, communications and computing flows indicated in the figure.

Flow of data and computation using \(4\) processors and \(1000\) values to be sorted using the *Parallel Sorting with Regular Selection Algorithm*.

- Identify each of the \(6\) phases of the algorithm in the figure.
- The two sets of arrays
*classStart[]*and*classLength[]*, and*recvStarts[]*and*recvLengths[]*are used as indices into*mydata[]*and*recvbuffer[]*on the*sending*processors and*receiving*processors in the*Gatherv*, respectively. The details of*Gatherv*will be discussed in the next section, but make sure you understand the scope of these arrays with respect to the different processors.