Abstract :
• Smaller data sets are problematic because — Overheads involved with launching a kernel — GPU underutilization: lower thread count has lower tolerance for long memory access latencies as there are no "other threads" ready to run. • Not all algorithms benefit from parallelization — Trade off between "parallelization effort" and "performance gain". — If it takes a lot of effort to parallelize an algorithm, it might mean that the algorithm is has no inherent parallelism, (e.g. the aircraft climbing metric).