assignment1 - Katyusha's blog

环境

OS wsl2 6.6.87.2-microsoft-standard-WSL2 Ubuntu 24.04.3 LTS

CPU: Intel Core i7 13620H 8 cores, 10 logic processors, AVX2

GPU GeForce RTX 4060 Laptop

assignment1#

https://github.com/stanford-cs149/asst1

prog1_mandelbrot_threads#

给定了计算并生成以下称为mandelbrot_set图像的代码，其中计算一个像素的时间与这个像素的亮度成正比

可以sudo apt install feh后，使用feh <image-path> 查看生成的图像

给出的mandelbrotSerial.cpp实现了单线程的计算，我们需要在mandelbrotThread.cpp中修改workerThreadStart函数，使用std::thread加速这个过程

我们先采用朴素的分配方式，将空间从上到下分为线程数份，分别分配给每个线程执行

1
void workerThreadStart(WorkerArgs * const args) {
2

3
    // TODO FOR CS149 STUDENTS: Implement the body of the worker
4
    // thread here. Each thread should make a call to mandelbrotSerial()
5
    // to compute a part of the output image.  For example, in a
6
    // program that uses two threads, thread 0 could compute the top
7
    // half of the image and thread 1 could compute the bottom half.
8

9
    printf("Hello world from thread %d\n", args->threadId);
10

11
    unsigned int num = args->height / args->numThreads;
12
    unsigned int st = args->threadId * num, ed = st + num;
13
    if (args->threadId == args->numThreads - 1) ed = args->height;
14

15
    mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, st, ed - st, args->maxIterations, args->output);
16
}

make后使用./mandelbort -t <线程数>运行得到线程数与加速比的关系如下第一次用matplotlib画图显示中文字符调了我半天:(

有意思的是，在view1中，使用3个线程加速比反而降低了，观察view1图像，中间部分计算量明显大于两边的计算量，因而负载不均匀导致加速比降低

一种有效的方式是讲整个图像划分为若干块，每个块内使用我们第一次的方式，因为相邻部分计算量差距较小，这就保证了负载尽量均匀，这被称为分块+线程内连续行分配的策略

1
void workerThreadStart(WorkerArgs * const args) {
2

3
    // TODO FOR CS149 STUDENTS: Implement the body of the worker
4
    // thread here. Each thread should make a call to mandelbrotSerial()
5
    // to compute a part of the output image.  For example, in a
6
    // program that uses two threads, thread 0 could compute the top
7
    // half of the image and thread 1 could compute the bottom half.
8

9
    printf("Hello world from thread %d\n", args->threadId);
10

11
    static constexpr unsigned int BLOCKSIZE = 128;
12
    unsigned int blocks = args->height / BLOCKSIZE;
13
    for (unsigned int i = 0; i < blocks; i++) {
14
        unsigned int block_st = i * BLOCKSIZE, block_ed = block_st + BLOCKSIZE;
15
        if (i + 1 == blocks) block_ed = args->height;
16
        unsigned int block_len = block_ed - block_st, thread_len = block_len / args->numThreads;
17
        unsigned thread_st = block_st + args->threadId * thread_len, thread_ed = thread_st + thread_len;
18
        if (args->threadId + 1 == args->numThreads) thread_ed = block_ed;
19
        mandelbrotSerial(args->x0, args->y0, args->x1, args->y1, args->width, args->height, thread_st, thread_ed - thread_st, args->maxIterations, args->output);
20
    }
21
}

得到的结果如下：

对比发现生成两张图片的加速比都高于简单的空间划分，加速比的波动是由于块的大小这个参数造成的

prog2_vecintrin#

要求我们使用intrinsics向量化重写代码，但是为了简化，使用的是cs149提供的库函数(事实上库函数也是标量模拟向量化)

阅读CS149intrin.cpp源码，其中每个函数功能如下

__cs149_mask _cs149_init_ones(int first=VECTOR_WIDTH)：返回一个前first位为1，其余位为0的mask向量

__cs149_mask _cs149_mask_not(__cs149_mask &maska)：将maska向量取反

__cs149_mask _cs149_mask_or(__cs149_mask &maska, __cs149_mask &maskb)：返回两个向量or运算后的向量

__cs149_mask _cs149_mask_and(__cs149_mask &maska, __cs149_mask &maskb)：返回两个向量and运算后的向量

int _cs149_cntbits(__cs149_mask &maska) ：返回maska中1的个数

对于接下来传入的mask参数，当某一位为1时表示对该位进行写入，为0时表示不写入

void _cs149_vset(__cs149_vec<T> &vecResult, T value, __cs149_mask &mask)：将value尝试写入向量每一位

void _cs149_vmove(__cs149_vec<T> &dest, __cs149_vec<T> &src, __cs149_mask &mask)：将src每一位尝试按位写入dest中

void _cs149_vload(__cs149_vec<T> &dest, T* src, __cs149_mask &mask)：将src数组每个位置按位尝试写入dest中

void _cs149_vstore(T* dest, __cs149_vec<T> &src, __cs149_mask &mask) ：将src每一位尝试写入数组dest中

void _cs149_vadd(__cs149_vec<T> &vecResult, __cs149_vec<T> &veca, __cs149_vec<T> &vecb, __cs149_mask &mask)：将veca和vecb每一位做加法之后尝试写入vecResult中，vsub，vmult，vdiv同理

void _cs149_vabs(__cs149_vec<T> &vecResult, __cs149_vec<T> &veca, __cs149_mask &mask)：将veca每一位取绝对值后，尝试写入vecResult中

void _cs149_vgt(__cs149_mask &maskResult, __cs149_vec<T> &veca, __cs149_vec<T> &vecb, __cs149_mask &mask)：将veca大于vecb按位比较得到的bool值尝试写入maskResult中，vlt，veq同理

void _cs149_hadd(__cs149_vec<T> &vecResult, __cs149_vec<T> &vec)：从0开始，计算相邻奇偶两项的和，直接写入vecResult的原位置中

void _cs149_interleave(__cs149_vec<T> &vecResult, __cs149_vec<T> &vec)：将vec中偶数索引放入vecResult前半部分，奇数索引放入后半部分

补全absVector#

提供的代码已经实现了absVector主体部分，但是没有处理N % VECTOR_WIDTH != 0的情况，需要我们补全对余数的处理

因为每次计算答案不会发生变化，一种比较简单的方法是对于最后VECTOR_WIDTH个元素再向量化处理一次

1
// implementation of absSerial() above, but it is vectorized using CS149 intrinsics
2
void absVector(float* values, float* output, int N) {
3
  __cs149_vec_float x;
4
  __cs149_vec_float result;
5
  __cs149_vec_float zero = _cs149_vset_float(0.f);
6
  __cs149_mask maskAll, maskIsNegative, maskIsNotNegative;
7

8
//  Note: Take a careful look at this loop indexing.  This example
9
//  code is not guaranteed to work when (N % VECTOR_WIDTH) != 0.
10
//  Why is that the case?
11
  int i;
12
  for (i=0; i + VECTOR_WIDTH - 1 <N; i+=VECTOR_WIDTH) {
13

14
    // All ones
15
    maskAll = _cs149_init_ones();
16

17
    // All zeros
18
    maskIsNegative = _cs149_init_ones(0);
19

20
    // Load vector of values from contiguous memory addresses
21
    _cs149_vload_float(x, values+i, maskAll);               // x = values[i];
22

23
    // Set mask according to predicate
24
    _cs149_vlt_float(maskIsNegative, x, zero, maskAll);     // if (x < 0) {
25

26
    // Execute instruction using mask ("if" clause)
27
    _cs149_vsub_float(result, zero, x, maskIsNegative);      //   output[i] = -x;
28

29
    // Inverse maskIsNegative to generate "else" mask
30
    maskIsNotNegative = _cs149_mask_not(maskIsNegative);     // } else {
31

32
    // Execute instruction ("else" clause)
33
    _cs149_vload_float(result, values+i, maskIsNotNegative); //   output[i] = x; }
34

35
    // Write results back to memory
36
    _cs149_vstore_float(output+i, result, maskAll);
37
  }
38
  if (N % VECTOR_WIDTH == 0) return;
39
  i = N - VECTOR_WIDTH;
40
  maskAll = _cs149_init_ones();
41
  maskIsNegative = _cs149_init_ones(0);
42
  _cs149_vload_float(x, values + i, maskAll);
43
  _cs149_vlt_float(maskIsNegative, x, zero, maskAll);
44
  _cs149_vsub_float(result, zero, x, maskIsNegative);
45
  _cs149_mask_not(maskIsNegative);
46
  _cs149_vmove_float(result, x, maskIsNegative);
47
  _cs149_vstore_float(output, result, maskAll);
48
}

实现clampedExpVector#

要求你使用intrinsics实现朴素的 $values[i]$ 的 $exponents[i]$ 次方计算，当结果与9.999999f取 $min$

直接实现即可

1
void clampedExpVector(float* values, int* exponents, float* output, int N) {
2

3
  //
4
//   CS149 STUDENTS TODO: Implement your vectorized version of
5
  // clampedExpSerial() here.
6
  //
7
  // Your solution should work for any value of
8
  // N and VECTOR_WIDTH, not just when VECTOR_WIDTH divides N
9
  //
10
    __cs149_mask maskAll = _cs149_init_ones();
11
    __cs149_vec_int zero = _cs149_vset_int(0);
12
    __cs149_vec_int one = _cs149_vset_int(1);
13

14
    auto calc = [&] (int i) {
15
        __cs149_vec_int count;
16
        _cs149_vload_int(count, exponents + i, maskAll);
17
        __cs149_vec_float base;
18
        _cs149_vload_float(base, values + i, maskAll);
19
        __cs149_vec_float result = _cs149_vset_float(1.f);
20
        __cs149_mask Ispositive;
21
        _cs149_vgt_int(Ispositive, count, zero, maskAll);
22
        while (_cs149_cntbits(Ispositive) > 0) {    //while(count > 0)
23
            _cs149_vmult_float(result, result, base, Ispositive);   //if (count[i] > 0) result[i] *= values[i]
24
            _cs149_vsub_int(count, count, one, maskAll);    //count--
25
            _cs149_vgt_int(Ispositive, count, zero, maskAll);
26
        }
27

28
        __cs149_vec_float bound = _cs149_vset_float(9.999999f);
29
        _cs149_vgt_float(Ispositive, result, bound, maskAll);   //if (result[i] > bound)
30
        _cs149_vmove_float(result, bound, Ispositive);  //result[i] = bound
31
        _cs149_vstore_float(output + i, result, maskAll); //output[i] = result[i]
32
    };
33

34
    for (int i = 0; i + VECTOR_WIDTH - 1 < N; i += VECTOR_WIDTH) {
35
        calc(i);
36
    }
37
    if (N % VECTOR_WIDTH == 0) return;
38
    calc(N - VECTOR_WIDTH);
39
}

在工作集大小为10000时，测试结果如下

1
CLAMPED EXPONENT (required)
2
Results matched with answer!
3
****************** Printing Vector Unit Statistics *******************
4
Vector Width:              4
5
Total Vector Instructions: 97070
6
Vector Utilization:        90.4%
7
Utilized Vector Lanes:     350981
8
Total Vector Lanes:        388280
9
************************ Result Verification *************************
10
Passed!!!

按照要求改动VECTOR_WIDTH参数，发现从2到16，向量利用率依次下降。当向量大小增大的时候，一个向量中出现分歧divergence的概率增加，导致部分向量空转，从而降低了平均向量利用率

arraySumVector(bonus)#

要求实现数组求和的向量化版本，保证了数组长度一定是VECTOR_WIDTH的倍数，并且VECTOR_WIDTH为2的幂

$O(\frac{n}{VECTOR\_WIDTH}+VECTOR\_WIDTH)$ 的做法是trivial的，考虑 $O(\frac{n}{VECTOR\_WIDTH}+ \log_2{VECTOR\_WIDTH})$ 的做法

在使用向量累加数组中的元素后，考虑怎么优化向量内部求和的过程，由于已经提供了hadd和interleave两个内置函数，考虑每次hadd之后再interleave，此后向量前半部分的和就是原向量的总和，这样我们只需要 $O(log_2{VECTOR\_WIDTh})$ 次迭代即可

1
// returns the sum of all elements in values
2
// You can assume N is a multiple of VECTOR_WIDTH
3
// You can assume VECTOR_WIDTH is a power of 2
4
float arraySumVector(float* values, int N) {
5

6
  //
7
  // CS149 STUDENTS TODO: Implement your vectorized version of arraySumSerial here
8
  //
9
    __cs149_vec_float acc = _cs149_vset_float(0.f);
10
    __cs149_mask maskAll = _cs149_init_ones();
11

12
    for (int i=0; i<N; i+=VECTOR_WIDTH) {
13
        __cs149_vec_float vec_values;
14
        _cs149_vload_float(vec_values, values + i, maskAll);
15
        _cs149_vadd_float(acc, acc, vec_values, maskAll);
16
    }
17
    for (int i = 1; i < VECTOR_WIDTH; i <<=1) {
18
        _cs149_hadd_float(acc, acc);
19
        _cs149_interleave_float(acc, acc);
20
    }
21
    float *ptr = new float[VECTOR_WIDTH];
22
    _cs149_vstore_float(ptr, acc, maskAll);
23
    float result = ptr[0];
24
    delete ptr;
25
    return result;
26
}

prog3_mandelbrot_ispc#

首先需要安装ispc, sudo snap install ispc即可

mandelbrot的ispc代码已经给出，我们只需要make后运行即可，结果如下

1
[mandelbrot serial]:            [140.509] ms
2
Wrote image file mandelbrot-serial.ppm
3
[mandelbrot ispc]:              [39.913] ms
4
Wrote image file mandelbrot-ispc.ppm
5
                                (3.52x speedup from ISPC)

发现实际只加速了3.5倍左右，与理论的AVX28倍加速相差较大，究其原因是mandelbrot计算中每个点的迭代次数不同，导致了divergence，拉低了有效利用率，同时对于内存的访问可能同样是性能的瓶颈

注意到此处，我们只使用了数据级并行，还可以使用多线程进行任务级并行，ispc提供了launch机制，是用户级线程池，相较std::thread开销更低，采取工作窃取方式进行调度，更为高效

虽然理论上当使用launch创建的任务数与逻辑处理器数相等的时候加速比最高，但是实际上更高的任务数尽管增加了线程通信的成本，却可能使得任务负载更加均衡，进而提高效率。本地测试在任务数为50左右最为高效

1
[mandelbrot serial]:            [140.531] ms
2
Wrote image file mandelbrot-serial.ppm
3
[mandelbrot ispc]:              [40.018] ms
4
Wrote image file mandelbrot-ispc.ppm
5
[mandelbrot multicore ispc]:    [4.202] ms
6
Wrote image file mandelbrot-task-ispc.ppm
7
                                (3.51x speedup from ISPC)
8
                                (33.45x speedup from task ISPC)

prog4_sqrt#

已经完成了牛顿迭代法求解平方根倒数的ispc代码，其中需要求解的数据在 $(0,3)$ 中，并且迭代次数与求解数据存在以下关系

最好/最坏加速比的构造#

你需要构造两组输入的数据，使得ispc加速比最大/最小

这是随机生成状态下的加速比

1
[sqrt serial]:          [633.462] ms
2
[sqrt ispc]:            [168.335] ms
3
[sqrt task ispc]:       [13.948] ms
4
                                (3.76x speedup from ISPC)
5
                                (45.42x speedup from task ISPC)

为了使得加速比最大化，我们既需要让指令级并行加速，既尽可能让divergence发生少，同时也需要加速任务级并行，即让每个任务负载尽量平均

一个很好的想法是输入值全部相同，这样不会发生divergence，同时任务的负载也是相当均匀的

注意到为了尽量提高加速比，在其他部分耗时不变的情况下，我们需要提高计算在总时间中的占比，取所有值都接近于3.0即可，此时运算最多，加速比也最大

1
[sqrt serial]:          [2752.093] ms
2
[sqrt ispc]:            [573.343] ms
3
[sqrt task ispc]:       [50.224] ms
4
                                (4.80x speedup from ISPC)
5
                                (54.80x speedup from task ISPC)

要构造最坏加速比，我们不难联想到课程中的提问，只需要让8位向量中出现divergence，其他7位很快计算完毕后。空转，1位执行非常多的计算即可，使得CPU达到最坏利用率

所以考虑每隔8个数，构造一个为计算次数最多的接近于3.0，其它都为无需计算的1.0

1
[sqrt serial]:          [420.101] ms
2
[sqrt ispc]:            [547.059] ms
3
[sqrt task ispc]:       [49.483] ms
4
                                (0.77x speedup from ISPC)
5
                                (8.49x speedup from task ISPC

有意思的是，ISPC向量化之后反而比标量执行更慢了，这可能是因为向量化引入的额外指令开销（如掩码操作、数据打包等）而变得更慢

手写intrinsics实现牛顿迭代法求平方根倒数(bonus)#

完成以下ISPC代码的AVX2实现

1
export void sqrt_ispc(uniform int N,
2
                      uniform float initialGuess,
3
                      uniform float values[],
4
                      uniform float output[])
5
{
6
    foreach (i = 0 ... N) {
7

8
        float x = values[i];
9
        float guess = initialGuess;
10

11
        float pred = abs(guess * guess * x - 1.f);
12

13
        while (pred > kThreshold) {
14
            guess = (3.f * guess - x * guess * guess * guess) * 0.5f;
15
            pred = abs(guess * guess * x - 1.f);
16
        }
17

18
        output[i] = x * guess;
19

20
    }
21
}

请输入文本

1
static void AVX2_invsqrt(int n, float initialGuess, float values[], float outputs[]) {
2
    __m256 vec_one = _mm256_set1_ps(1.0f);
3
    __m256 vec_three = _mm256_set1_ps(3.0f);
4
    __m256 vec_half = _mm256_set1_ps(0.5f);
5
    __m256 vec_neg = _mm256_set1_ps(-0.0f);
6
    __m256 vec_threshold = _mm256_set1_ps(kThreshold);
7

8
    auto calc = [&](int i) {
9
        __m256 vec_x = _mm256_loadu_ps(values + i);
10
        __m256 vec_guess = _mm256_set1_ps(initialGuess);
11

12
        __m256 vec_pred = _mm256_mul_ps(vec_guess, vec_guess);
13
        vec_pred = _mm256_mul_ps(vec_pred, vec_x);
14
        vec_pred = _mm256_sub_ps(vec_pred, vec_one);
15
        vec_pred = _mm256_andnot_ps(vec_neg, vec_pred);
16

17
        __m256 mask = _mm256_cmp_ps(vec_pred, vec_threshold, _CMP_GT_OQ);
18
        while (_mm_popcnt_u32(_mm256_movemask_ps(mask))) {
19
            __m256 vec_sub, new_guess;
20
            new_guess = _mm256_mul_ps(vec_guess, vec_three);
21
            vec_sub = _mm256_mul_ps(vec_guess, vec_guess);
22
            vec_sub = _mm256_mul_ps(vec_sub, vec_guess);
23
            vec_sub = _mm256_mul_ps(vec_sub, vec_x);
24
            new_guess = _mm256_sub_ps(new_guess, vec_sub);
25
            new_guess = _mm256_mul_ps(new_guess, vec_half);
26
            vec_guess = _mm256_blendv_ps(vec_guess, new_guess, mask);
27

28
            __m256 new_pred;
29
            new_pred = _mm256_mul_ps(vec_guess, vec_guess);
30
            new_pred = _mm256_mul_ps(new_pred, vec_x);
31
            new_pred = _mm256_sub_ps(new_pred, vec_one);
32
            new_pred = _mm256_andnot_ps(vec_neg, new_pred);
33
            vec_pred = _mm256_blendv_ps(vec_pred, new_pred, mask);
34

35
            mask = _mm256_cmp_ps(vec_pred, vec_threshold, _CMP_GT_OQ);
36
        }
37

38
        _mm256_storeu_ps(outputs + i, _mm256_mul_ps(vec_guess, vec_x));
39
    };
40
    for (int i = 0; i + 7 < n; i += 8) calc(i);
41
    if (n % 8 == 0) return;
42
    calc(n - 8);
43
}

实际性能如下：

1
[sqrt serial]:          [425.807] ms
2
[sqrt ispc]:            [541.254] ms
3
[sqrt AVX2]:            [520.926] ms
4
[sqrt task ispc]:       [53.460] ms
5
                                (0.79x speedup from ISPC)
6
                                (0.82x speedup from AVX2)
7
                                (7.96x speedup from task ISPC)

发现相较ISPC没有显著提升，~~所以没事别来写这一坨~~

prog5_saxpy#

给了我们使用ISPC加速saxpy计算result=scale*X+Y，其中result,X,Y为 $2\times10^7$ 大小的向量，scale为标量

运行后结果如下：

1
[saxpy serial]:         [10.404] ms     [28.644] GB/s   [3.845] GFLOPS
2
[saxpy ispc]:           [8.441] ms      [35.306] GB/s   [4.739] GFLOPS
3
[saxpy task ispc]:      [4.405] ms      [67.656] GB/s   [9.081] GFLOPS
4
                                (1.92x speedup from use of tasks)
5
                                (1.23x speedup from ISPC)
6
                                (2.36x speedup from task ISPC)

我们发现加速效果相当差，远低于理论加速。不难发现，每两次运算(一次乘法和加法)需要访问内存四次(读取X,读取Y,对result读取所有权，写入result)共32字节，这与每访问字节进行10至20次运算的理想访存计算比相差极大，所以性能瓶颈在内存访问上，而不是计算上，无法通过并行计算达到近线性的加速

附加题：尝试有效地加速该程序不会:(

prog6_kmeans#

K-Means算法实现了将一堆散乱的数据点分为k簇，所有数据点到其所属簇中心的距离平方和是一个局部最小值

给出你一段K_Means的正确但未优化的代码，要求自行找到性能热点并加速，要求至少达到2.1倍加速比

在main.cpp中去掉部分注释，本地生成数据，原始代码效率如下(运行时可能稍有波动)

1
Reading data.dat...
2
Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
3
[Total Time]: 6587.359 ms

在重复迭代直至收敛的过程中，执行了三个函数computeAssignments，computeCentroids，computeCost，使用课程提供的CycleTimer::currentSeconds发现耗时分别如下

1
Reading data.dat...
2
Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
3

4
computeAssignments: 4187.6ms
5
computeCentroids: 874.883ms
6
computeCost: 1277.58ms
7

8
assign_init: 28.4267ms assign_assign: 4158.83ms
9
centr_zero: 0.011599ms centr_sum: 874.803ms centr_compute: 0.003952ms
10
cost_zero: 0.008803ms cost_sum: 1277.49ms cost_update: 0.014024ms
11
[Total Time]: 6340.125 ms

发现性能瓶颈在computeAssignments中的分配部分，源代码如下

1
  for (int k = args->start; k < args->end; k++) {
2
    for (int m = 0; m < args->M; m++) {
3
      double d = dist(&args->data[m * args->N],
4
                      &args->clusterCentroids[k * args->N], args->N);
5
      if (d < minDist[m]) {
6
        minDist[m] = d;
7
        args->clusterAssignments[m] = k;
8
      }
9
    }
10
  }

考虑对这部分进行优化，阅读发现是对于每一个点，求离k个簇中哪个簇距离最近并更新

我们发现这k次更新具有并行性，可以考虑使用k个线程

1
std::mutex mtx;
2

3
void assign(double *minDist, WorkerArgs *const args, int k) {
4
    for (int m = 0; m < args->M; m++) {
5
      mtx.lock();
6
      double d = dist(&args->data[m * args->N],
7
                      &args->clusterCentroids[k * args->N], args->N);
8
      if (d < minDist[m]) {
9
        minDist[m] = d;
10
        args->clusterAssignments[m] = k;
11
      }
12
      mtx.unlock();
13
    }
14
}
15

16
double assign_init = 0, assign_assign = 0;
17
void computeAssignments(WorkerArgs *const args) {
18
  double *minDist = new double[args->M];
19

20
  // Initialize arrays
21
  double timer; Set_timer(timer);
22
  for (int m =0; m < args->M; m++) {
23
    minDist[m] = 1e30;
24
    args->clusterAssignments[m] = -1;
25
  }
26
  assign_init += Get_timer(timer);
27

28
  // Assign datapoints to closest centroids
29
  Set_timer(timer);
30
  static constexpr int SIZE = 3;
31
  std::thread workers[SIZE];
32
  for (int k = args->start + 1; k < args->end; k++) {
33
    workers[k] = std::thread(assign, minDist, args, k);
34
  }
35
  assign(minDist, args, args->start);
36
  for (int k = args->start + 1; k < args->end; k++) {
37
    workers[k].join();
38
  }
39
  assign_assign += Get_timer(timer);
40

41
  delete[] minDist;
42
}

注意会产生多个线程对于minDist和args->clusterAssignments的更新，需要上锁

效率如下：(我自己都要蚌埠住了)

1
Reading data.dat...
2
Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
3

4
computeAssignments: 10419.6ms
5
computeCentroids: 843.017ms
6
computeCost: 1184.35ms
7

8
assign_init: 25.4263ms assign_assign: 10393.9ms
9
centr_zero: 0.009089ms centr_sum: 842.949ms centr_compute: 0.004045ms
10
cost_zero: 0.007699ms cost_sum: 1184.28ms cost_update: 0.009066ms
11
[Total Time]: 12447.063 ms

效率严重下降，因为加锁的代价大于线程的收益

于是考虑将m个任务分配给多个线程进行并行计算，这样保证了同一个资源只会被一个线程访问

1
void assign(double *minDist, WorkerArgs *const args, int l, int r) {
2
    for (int k = args->start; k < args->end; k++) {
3
        for (int m = l; m < r; m++) {
4
        double d = dist(&args->data[m * args->N],
5
                        &args->clusterCentroids[k * args->N], args->N);
6
        if (d < minDist[m]) {
7
            minDist[m] = d;
8
            args->clusterAssignments[m] = k;
9
        }
10
        }
11
    }
12
}
13

14
double assign_init = 0, assign_assign = 0;
15
void computeAssignments(WorkerArgs *const args) {
16
  double *minDist = new double[args->M];
17

18
  // Initialize arrays
19
  double timer; Set_timer(timer);
20
  for (int m =0; m < args->M; m++) {
21
    minDist[m] = 1e30;
22
    args->clusterAssignments[m] = -1;
23
  }
24
  assign_init += Get_timer(timer);
25

26
  // Assign datapoints to closest centroids
27
  Set_timer(timer);
28
  static constexpr int THREADS = 32;
29
  std::thread workers[THREADS];
30
  int len = args->M / THREADS;
31
  for (int i = 1; i < THREADS; i++) {
32
      int l = i * len, r = l + len;
33
      if (i == THREADS - 1) r = args->M;
34
      workers[i] = std::thread(assign, minDist, args, l, r);
35
  }
36
  int l = 0, r = len;
37
  assign(minDist, args, l, r);
38
  for (int i = 1; i < THREADS; i++) workers[i].join();
39
  assign_assign += Get_timer(timer);
40

41
  delete[] minDist;
42
}

结果如下，刚好达到了要求的加速比，代码的其他部分也可以类似地并行化来提高效率，但是题目只要求了并行化一处

1
Reading data.dat...
2
Running K-means with: M=1000000, N=100, K=3, epsilon=0.100000
3

4
computeAssignments: 1067.84ms
5
computeCentroids: 872.724ms
6
computeCost: 1160.83ms
7

8
assign_init: 25.0641ms assign_assign: 1042.34ms
9
centr_zero: 0.009931ms centr_sum: 872.655ms centr_compute: 0.004261ms
10
cost_zero: 0.008192ms cost_sum: 1160.75ms cost_update: 0.012866ms
11
[Total Time]: 3101.462 ms

Katyusha's blog