本帖最后由 PolyMorph 于 2016-5-27 17:56 编辑
Current GPUs (for example, as shown in FIG. 3) only support a single uniform wavefront size (for example, logically supporting 64 thread wide vectors by piping threads through 16 thread wide vector units over four cycles). Vector units of varying widths (for example, as shown in FIG. 4) may be provided to service smaller wavefronts, such as by providing a four thread wide vector unit piped over four cycles to support a wavefront of 16 element vectors. In addition, a high-performance scalar unit may be used to execute critical threads within kernels faster than possible in existing vector pipelines, by executing the same opcodes as the vector units. Such a high performance scalar unit may, in certain instances, allow for a laggard thread (as described above) to be accelerated. By dynamically issuing wavefronts to the execution unit best suited for their size and performance needs, better performance and/or energy efficiency than existing GPU architectures may be obtained. If a wavefront has 64 threads (but only 16 active threads), instead of scheduling the wavefront to a 16 thread wide SIMD unit, the wavefront may be scheduled to a four thread wide SIMD unit. Based on demand (a need basis), the scheduler determines whether to schedule the wavefront to all four thread wide SIMD units or just to a subset of the SIMD units. The threads migrating between these functional units can have their context (register values) migrated with the help of software (using “spill” and “fill” instructions) or with dedicated hardware that helps the migration. Alternately, only the data needed for the upcoming instruction or instructions can be forwarded along with the work through a register functional unit crossbar or other interconnection. This determination provides a finer granularity control over how the threads are executed. By dispatching work to a narrower vector unit compared to the baseline wide vector unit, it is possible to execute only as many threads as will actually produce results, thereby saving power.
不同线程数量的wavefront发送给不同宽度的SIMD执行,可以减少空跑率 |