非交换算子的单块并行约简

Created: November-22, 2018

与可交换版本相比，对非交换运算符进行并行缩减更为复杂。在示例中，为简单起见，我们仍然使用整数加法。例如，它可以用矩阵乘法代替，它实际上是非交换的。注意，这样做时，0 应该被乘法的中性元素替换 - 即单位矩阵。

static const int arraySize = 1000000;
static const int blockSize = 1024;

__global__ void sumNoncommSingleBlock(const int *gArr, int *out) {
    int thIdx = threadIdx.x;
    __shared__ int shArr[blockSize*2];
    __shared__ int offset;
    shArr[thIdx] = thIdx<arraySize ? gArr[thIdx] : 0;
    if (thIdx == 0)
        offset = blockSize;
    __syncthreads();
    while (offset < arraySize) { //uniform
        shArr[thIdx + blockSize] = thIdx+offset<arraySize ? gArr[thIdx+offset] : 0;
        __syncthreads();
        if (thIdx == 0)
            offset += blockSize;
        int sum = shArr[2*thIdx] + shArr[2*thIdx+1];
        __syncthreads();
        shArr[thIdx] = sum;
    }
    __syncthreads();
    for (int stride = 1; stride<blockSize; stride*=2) { //uniform
        int arrIdx = thIdx*stride*2;
        if (arrIdx+stride<blockSize)
            shArr[arrIdx] += shArr[arrIdx+stride];
        __syncthreads();
    }   
    if (thIdx == 0)
        *out = shArr[0];
}

...

sumNoncommSingleBlock<<<1, blockSize>>>(dev_a, dev_out);

在第一个 while 循环中执行，只要有多个输入元素而不是线程。在每次迭代中，执行单次缩减，并将结果压缩到 shArr 数组的前半部分。然后下半部分将填充新数据。

从 gArr 加载所有数据后，执行第二个循环。现在，我们不再压缩结果（这需要额外费用 3）。在每个步骤中，线程 n 访问 2*n-th 活动元素并将其与 tihuan 第 5 个元素相加：

有许多方法可以进一步优化这个简单的例子，例如通过减少 warp 级别和删除共享内存库冲突。