PCEVA,PC绝对领域,探寻真正的电脑知识
打印 上一主题 下一主题
开启左侧

y-cruncher -0.7.3 发布,对支持 AVX512 的一些重要说明

[复制链接]
跳转到指定楼层
1#
点击数:22618|回复数:70
本帖最后由 chungexcy 于 2017-7-12 11:17 编辑

http://www.numberworld.org/y-cruncher/

y-cruncher最近更新了对 Skylake-X 在AVX512上的支持,同时说了不少对这CPU的看法。我翻译了一下原文:
---------------------------------------------------------------------------------------------
Skylake X and AVX512: (July 6, 2017)

Let's talk about Skylake X and AVX512. Because everyone's been waiting for this. Since there's currently a lack of AVX512 benchmarks and stress tests. And because of that, I've had at least half a dozen people and organizations contact me about y-cruncher's AVX512.

Okay... some AVX512 benchmarks already existed. SiSoftware Sandra had some support. And my little-known FLOPs benchmark did too. But people either weren't aware of them, or wanted more. And by advertising y-cruncher's internal AVX512 support for at least a year now, I basically brought this on myself.

So let's get to the point. Unfortunately, AVX512 will not bring the "instant massive performance gain" that a lot of people were expecting. Realistically speaking, the speedups over AVX2 seem to vary around 10 - 50% - usually on the lower end of that scale. While the investigation is on-going, there are some known factors:
  • Not all Skylake X and Skylake Purley processors will have the full AVX512 capability.
  • "Phantom throttling" of performance when certain thermal limits are exceeded.
  • Memory bandwidth is a significant bottleneck.
  • Amdahl's law and other unknown scalability issues.

Skylake X和AVX512:(2017年7月6日)

我们来谈谈Skylake X和AVX512。因为每个人都在等待这个。由于目前缺乏AVX512基准测试和压力测试,因此至少有六个人和组织联系我关于y-cruncher的AVX512。

好吧,一些AVX512基准测试已经有了。 SiSoftware Sandra对此有一些支持。而我的鲜为人知的FLOPs基准测试也是如此。但是人们要么不知道他们,要么就是想要更多。而y-cruncher的内部AVX512支持,至少已经宣传了一年了,至少我自己是这么想的。

所以我们来看一下吧。不幸的是,AVX512不会带来很多人期待的“立即巨大的性能增益”。实际上,相比于AVX2加速,AVX512似乎在10-50%之间变化 - 通常在小规模上。现在的调查还在进行中,不够一些已知的因素如下:
  • 并不是所有的Skylake X和Skylake Purley处理器都将拥有完整的AVX512功能。
  • 当超过某些过热限制时,性能会被“Phantom限制”。
  • 内存带宽是一个明显的瓶颈。
  • Amdahl的定律和其他未知的可扩展性问题。


Not all Skylake X and Skylake Purley processors will have the full AVX512 capability:

While this reason doesn't apply to my system, it's worth mentioning it anyway.

Architecturally, Skylake X retains Skylake desktop's architecture with 2 x 256-bit FMA units. In Skylake X, those two 256-bit FMA units can merge to form a single 512-bit FMA. On the processors with full-throughput AVX512, there is also a dedicated 512-bit FMA - thereby providing 2 x 512-bit FMA capability.

However, that dedicated 512-bit FMA is only enabled on the Core i9 parts. The 6-core and 8-core Core i7 parts are supposed to have it disabled. Therefore they only have half the AVX512 performance.

It's worth mentioning that there is abenchmark on an engineering-sample 6-core Core i7 that shows full-throughput AVX512 anyway. However, engineering sample processors are not always representative of the retail parts.

So as of this writing, I still don't know if the 6 and 8-core Skylake X Core i7's have the full AVX512. The only Skylake X processor I have at this time is the Core i9 7900X which is supposed to have the full AVX512 anyway. (and indeed it does based on my tests)

并不是所有的Skylake X和Skylake Purley处理器都将拥有完整的AVX512功能:

虽然这个原因不适用于我的系统,但这点还是值得一提的。

SkylakeX在架构上保留了Skylake桌面架构的2 x 256位FMA单元。在Skylake X中,这两个256位FMA单元可以合并形成一个512位的FMA。在具有全吞吐量AVX512的处理器上,还有一个专用的512位FMA,从而提供2 x 512位FMA功能。

但是,专用的512位FMA仅在Core i9上启用。 6核心和8核Corei7部件应该是禁用的。所以他们只有一半的AVX512性能。

值得一提的是,在工程样品6核的Corei7上有一个基准测试,显示全吞吐量的AVX512。然而,工程样品CPU并不总是代表零售样品。

所以在撰写本文时,我还是不知道6和8核心的Skylake X Core i7是否有完整的AVX512。目前我唯一拥有的Skylake X处理器是Core i9 7900X,它应该有完整的AVX512。 (实际上根据我的测试也确实如此)


"Phantom throttling" of performance when certain thermal limits are exceeded:

Within minutes of getting my system setup, I started noticing inconsistencies in performance. And after spending a long Friday night investigating the issue, I determined that there was a sort of "Phantom throttling" of AVX512 code when certain thermal limits are exceeded.

"Phantom throttling" is the term that I used to describe the problem in my emails with the Silicon Lottery vendor. And it looks like I'm not the only one using that term anymore. Phantom throttling is when the processor gets throttled without a change in clock frequency. For many years, processors have throttled down for many reasons to protect it from damage. But when throttling happens, it has always been done by lowering the clock frequency - which is visible in a monitor like CPUz. Skylake X is the first line of processors to break from this and it makes it more difficult to detect the throttling.

Right now, the phantom throttling phenomenon is still not well understood. Overclocker der8auer has mentioned that it could be caused by CPUz not reacting fast enough to actual clock frequency changes. On the other hand, the tests that Silicon Lottery and myself have done seem to show the that there really is no drop in clock frequency at all.

Initially, I observed this effect only with AVX512 code and thus hypothesized that the mechanism behind the throttling is the shutdown of the dedicated 512-bit FMA. But others have found that phantom throttling also occurs on AVX and scalar code as well. In short, much more investigation is needed. The lack of AVX512 programs out there certainly doesn't help and is partially why I'm rushing this release of y-cruncher v0.7.3.

Currently, there are no known reliable ways of stopping the throttling and results vary heavily by motherboard manufacturer. But maxing out thermal limits and disabling all thermal protections seems to help. (Don't try this at home if you don't know what you're doing or you aren't at least moderately experienced in overclocking. You can destroy your processor and/or motherboard if you aren't careful.)

当超过某些过热限制时,性能会被“Phantom限制”:

在我的系统做好的几分钟之内,我开始注意到性能不一致。在花了一个漫长的星期五晚上调查这个问题后,我确定当超过某些过热限制时,AVX512代码有一种“Phantom”的限制。

“Phantom限制”是我在与Silicon Lottery供应商的邮件中,用来描述这个问题的术语。看来我不是唯一一个使用这个术语的人了。Phantom限制是当处理器遇到限制而没有发生时钟频率的变化。多年来,处理器因为许多原因而被限制,以防止损坏。但是当限制发生时,一般都是通过降低时钟频率来实现,这种现象在CPUz这样的软件监控中是可见的。 SkylakeX是第一个打破这一点的处理器,它使的对限制的检测更加困难。

现在,Phantom限制现象还不太能清楚的理解。超频玩家der8auer提到这可能是由于CPUz对实际的时钟频率变化没有足够的时间反应。另一方面,Silicon Lottery和我自己所做的测试则似乎表明,根本没有时钟频率的下降。

最初,我也只是在AVX512代码上观察到这种效果,因此假设Phantom限制的机制是因为关闭了专用的512位FMA。但是其他人发现Phantom限制也发生在AVX和标量(非SIMD)代码上。总之,这还需要更多的调查。缺少AVX512程序肯定没有帮助,这也是为什么我急忙推出这个版本的y-cruncher v0.7.3原因的其中之一。

目前,没有已知的可靠的方法来阻止Phantom限制,不同的主板的结果也有很大差异。但是,过热限制调到最大并禁用所有过热保护似乎有所帮助。 (如果你不知道自己在做什么,或者你在超频中至少没有适度的经验,请不要在家中尝试。如果不小心,你会烧毁你的处理器和/或主板。)


Memory bandwidth is a significant bottleneck:

y-cruncher was already slightly memory-bound on Haswell-E. Now on Skylake X, it is much worse. While I had anticpiated a memory bottleneck on Skylake X with AVX512, it seems that I've underestimated the severity of it:

(The CPU frequencies in this benchmark were chosen to be low enough to avoid any throttling or phantom throttling.)

1 billion digits of Pi - Core i9 7900X @ 3.8 GHz - Times in Seconds
  1. Threads         Memory Frequency         Instruction Set
  2.                                          AVX2         AVX512
  3. 1 thread
  4.                      2133 MHz         444.434         325.543
  5.                      3200 MHz         438.432         319.737
  6. 20 threads
  7.                      2133 MHz          51.884         45.658
  8.                      3200 MHz          47.672         39.723
复制代码

In the single threaded benchmarks, the memory frequency has less than 2% effect for both AVX2 and AVX512. Multi-threaded, that jumps to 9% and 15% respectively. This is much more than is expected for a program that used to be completely compute-bound just a few years ago.

内存带宽是一个重大的瓶颈:

y-cruncher在Haswell-E上已经略有一点遇上内存带宽的瓶颈。现在在Skylake X上,情况更糟。虽然我已经预见AVX512在Skylake X的内存瓶颈,但似乎我低估了它的严重性:

(该基准测试中的CPU频率被选择为足够低以避免任何限制或Phantom限制。)
        
在单线程基准测试中,内存带宽的影响在AVX2和AVX512上均小于2%。这在多线程上突然增大到9%和15%。这远远超过了几年前以前对这种完全计算的程序的内存瓶颈表现预期。


Amdahl's law and other unknown scalability issues:

In a typical y-cruncher computation, only about 80% of the CPU time is spent running vectorized code when AVX2 is used. So by Amdahl's law, even if we get perfect scaling with the AVX512, we can only cut 40% off the run-time. Right now, the single-threaded benchmarks (which are least memory-bound) are only showing 27% speedup with AVX512 over AVX2.

This remaining 13% discrepancy is currently unresolved. Microbenchmarks of y-cruncher's AVX512 code show near perfect 2x speedups over AVX2. (Some show >2x thanks to the increased register count.) But this speedup seems to drop off as the data sizes increase - even while still fitting in cache. This seems to hint at unknown bottlenecks within the L2 and L3 caches. The fact that cache sizes haven't increased along with wider the SIMD also doesn't help.

For now, investigation is difficult because none of my performance profilers support Skylake X yet.

阿姆达尔定律和其他未知的可扩展性问题:

在典型的y-cruncher计算中,当使用AVX2时,只有80%的CPU时间用于运行矢量化代码。所以根据阿姆达尔定律,即使我们使用AVX512进行了完美的缩放,我们只能把运行时间内减少40%。而现在,即使是单线程的(内存瓶颈最少的)基准测试,也只能看到AVX512比AVX2有27%的性能提升。

目前剩余的13%还尚未解决。 y-cruncher的AVX512代码的微基准测试,表现出了与AVX2相当完美的两倍加速(由于增加的寄存器数量,有些表现出> 2x)。但是这种加速似乎随着数据大小的增加而下降 ( 即使其数据量大小仍能完全跑在缓存里)。这似乎暗示L2和L3高速缓存中的未知瓶颈。事实上,缓存大小没有随着更宽的SIMD增加,这一点也没有帮助。

现在调查很困难,因为我的性能分析器都没有支持Skylake X。


Implications for Stress-Testing:

y-cruncher's failure to achieve a decent speedup for AVX512 also means that it is unable to put a heavy load on the AVX512 computation units. Therefore it is not a great stress-test for Skylake X with full AVX512.

But there is one y-cruncher feature which seems to be unaffected - the BBP benchmark.

The BBP benchmark feature is contained entirely in cache is thus free of the memory bottleneck. It is able to put a much higher stress than the stress-tester and the computations. So if you run the BBP benchmark (option 4) and set the offset to 100 billion, you can still put a pretty heavy load on your AVX512-capable processor.

A future version of y-cruncher will revamp the stress-tester to incorporate the BBP benchmark as well as other possible improvements.

压力测试的影响:

对于AVX512,y-cruncher未能实现全面的加速,这也意味着无法对AVX512计算单元造成沉重的负担。因此,对于具有完整AVX512的Skylake X来说,这不是很大的压力测试。

但是有一个y-cruncher功能似乎不受影响 - BBP基准测试。

BBP基准测试功能完全跑在缓存中,因此没有内存瓶颈。它能够比压力测试软件和计算软件造成更高的压力。因此,如果你运行BBP基准测试(选项4)并将偏移量设置为100billion,你仍然可以在支持AVX512的处理器上造成相当大的负担。

y-cruncher的未来版本将重新改进压力测试部分,以纳入BBP基准以及其他可能的改进。

---------------------------------------------------------------------------------------------

个人对
y-cruncher说法的解读:

1. AVX512计算单元:

我同意作者的说法,skylake-x 应该就是 (avx256+avx256) + avx512,前俩可以拼成avx512来用。
The ports are split as in the diagram, with two load/store units, one dedicated store address unit and one dedicated store data unit. The four ALU ports support a subset of all the ALU features, which Intel states has been balanced over Haswell. The full breakdown is as follows:

    Port 0: ALU/Vec ALU, Vec Shft/Vec Add, Vec Mul/FMA, DIV, Branch2
    Port 1: ALU/Vec ALU/Fast LEA, Vec Shift/Vec Add, Vec Mul/FMA, Slow Int, Slow LEA
    Port 2: Load/Store Address
    Port 3: Load/Store Address
    Port 4: Store Data
    Port 5: ALU/Vec ALU/Fast LEA, Vec Shuffle, (FMA on 10-core SKL-X)
    Port 6: ALU/Shift, Branch1
    Port 7: Store Address

以上是anandtech的描述,Port0和Port1各有一组FMA256,有一组FMA512在Port5。
Nominally the FMAs on ports 0 and 1 are 256-bit, so in order to drive towards the AVX-512-F these two ports are fused together, similar to how AVX-512-F is implemented in Knights Landing. The six-core and eight-core Skylake-X parts support one fused FMA for AVX-512-F, although the 10-core will support dual 512-bit AVX-512-F ports, which seems to be located on port 5. This means that the 10-core i9-7900X can support 64 SP or 32 DP calculations per cycle, whereas the 8-core/6-core parts can support 32 SP or 16 DP per cycle.

这组AVX512怎么实现的,我觉得其实并不重要。即使是两组拼起来的,你也只能把它当成一个整体AVX512使用,也不能当成多余的两组AVX256来用。
也就是说,原来的AVX256优化的程序,优化的再完美,也不会有特别的提升,因为它能看到的依然是原来的两组AVX256。除非用AVX512重新优化。

更新:Skylake-SP的PPT已经确认了。



2. AVX512的寄存器,宽度和数量翻倍:

AVX256时代,256位的寄存器只有16个,YMM0–YMM15。
AVX512,首先把256位的寄存器提升到了512位,同时把数量也提升到了32个,ZMM0-ZMM31。
相比AVX256,寄存器里能暂存的数据,是原来的4倍。双倍的寄存器数量,能带来更好的命中率,进而减小一级缓存的读写压力。


3. 一级缓存带宽翻倍:

根据anandtech的说法,Intel实际是把一级缓存改进了:读取带宽从2x256bit/cycle,提升到了2x512bit/cycle;写入带宽从256bit/cycle,提升到了512bit/cycle。
简单的理解是满足a+b=c,读取a,b花一个周期,写回c也花一个周期c,每个数都是512位。
The four load/store related units serve the writeback 32KB L1 Data cache with 8-way associativity and 4-cycle latency. In Skylake-S, this supported two 32-byte reads and one 32-byte store per cycle: in the new Skylake-SP core, this is doubled although Intel only stated that it was ‘128 bytes per cycle read and 64 bytes per cycle write’, which we would assume to mean 2x64B read and 1x64B write. This is backed by an L1D TLB, supporting 64x4KB entries per thread with 4-way associativity.

参考上图intel的ppt。AIDA64没能测出来,我估计是因为AIDA64现在的一缓测试,没有支持512位的缘故吧。7900x@4Ghz,10核心加起来,应该有5120GB/s的一缓读取性能。现在的测试都没能很好的跑出L1的极限,有待优化。techreport的测试倒是可以看出一些端倪:


一缓大小依然是32k data和32k inst,这一点从core i系列第一代 Nehalem 就没变过。


4. Phantom throttling:

之前 @royalk 在预告里就和我提到了这个问题,这个感觉只能有些猜想。
基础频率撞到功耗墙,估计是Phantom throttling;撞倒温度墙应该就是直接高温的核心降频了。
如果是睿频下撞功耗墙,应该先不断降频直到功耗刚好等于TDP。如果都降到基础频率,功耗依然过大,这时候应该就是Phantom throttling了。
以上都是我的猜想。

Phantom throttling,我比较同意作者的猜想,可能是关闭部分计算单元来缓轻功耗压力,比如两FMA关一个,或者两ALU关一个。如果真是这样,我觉得这种做法的能耗比远不如降频。打个比方,多核满载下,关核心不如稍微降低一点频率。


5. 内存带宽:
我印象中y-cruncher在这代以前,是内存不敏感的。这次单线程1.5%,10核20线程9-15%的差距,我猜可能是因为L3变成排除式的原因,可能内存的性能直接关乎L2。这点AIDA64没更新依然不好去证实或实验。AIDA64 v5.92依然未提到AVX512。


6. AVX512的边际效率问题:
阿姆达尔定律:

并行计算的加速比极限,是由不能并行的比例决定的。比如你只有90%的能够并行,那么你就算有再多的核心,一瞬间完成全部并行的部分,也只能加速到原来的10倍!

那么问题来了:

1. 一个程序,假如不能并行的部分需要花费60s,能并行的部分需要花费160s,总共是220s运行时间,这是标量计算需要的时间。
2. 现在你对能并行部分进行SSE2的优化,原来160s的计算减少到了40s,总时间减少到了100s。这时,你发现Intel的sse,让你的代码性能提升了1.2倍!!!
3. 某天,你发现intel提供了AVX,这时你把代码优化到了AVX,原来40s的降低到了20s,总时间减少到了80s。这时你发现提升只有25%了,之前可是120%的。
4. 你又用AVX512进行了优化,这次把剩余的20s降低到了10s,总时间降低到了70s,提升只有12%了。

以上还是理想情况,220s->100s->80s->70s。这个例子就是现在x264编码器的现状。

y-cruncher的avx256的单线程并行运行比例是80%,之前SSE到AVX2,甚至超过了2倍的提升,提升非常完美。
然而,AVX512理想情况是降低一半的40%。实际情况却是,单线程降低了27%(此时内存都不是瓶颈)。3200的内存下,20线程只降低了17%,2133的内存竟然只有12%,瓶颈明显在各级缓存和内存带宽上了,这个是最难优化的。
如果你比较avx2@3200MHz内存,与AVX512@2133Mhz内存,性能差距仅仅4.2%。。。Skylake X/EP的内存频率的影响,应该比想象中的大得多。

以下是5960x和优化后的7900x的对比。所以,不是每个程序都有linpack的并行性。


曾经Skylake桌面版是打算上AVX512的,可能这也是Intel为什么一再推迟AVX512的原因之一吧。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x

评分

参与人数 2代金券 +100 绝对值 +1 收起 理由
石头 + 100
ydjj + 1 好文学习

查看全部评分

2#
chungexcy  楼主| 发表于 2017-7-8 17:20 | 显示全部楼层
gtx9 发表于 2017-7-8 16:51
intel 官方的测试,算上核心数的差异

LINPACK 的AVX512效率是80%左右

AVX2的linpack也就是80%多点。这样看,skylake sp的实际linpack频率,要比broadwell sp低10%。

3#
chungexcy  楼主| 发表于 2017-7-8 23:48 | 显示全部楼层
gtx9 发表于 2017-7-8 17:34
还有

已经支持AVX512的SiSoftware缓存测试


http://techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one/4

我感觉sisoftware的缓存带宽测试还有待优化,数值也不能直接对比别的测试软件。
数据大小在32K*10内,峰值冲到了6950x的1.5x以上。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x
4#
chungexcy  楼主| 发表于 2017-7-9 12:13 | 显示全部楼层
royalk 发表于 2017-7-9 11:56
512的效率是捉急,但是这代渲染和编解码类IPC有小幅提升,大概在7-10%,如果是2.5到3.4G的差距,那么ryze ...

这个评测是基于AVX2的,50mililion,默认频率是1.795s。
按照之前贴出的表格,50mililion用avx512,能下降到1.475s,不过提升依然有限。

5#
chungexcy  楼主| 发表于 2017-7-9 16:18 | 显示全部楼层
royalk 发表于 2017-7-9 15:03
刚跑了一下build 9471,4G/2.4/3200的时候是1.577s

难道官方有特别的跑分方式?
1 billion的表现差异大么?
6#
chungexcy  楼主| 发表于 2017-7-9 16:22 | 显示全部楼层
本帖最后由 chungexcy 于 2017-7-9 16:25 编辑
royalk 发表于 2017-7-9 16:14
更正一下之前我回帖中的一些错误,我本以为默认状态下GFLOPS低是因为TDP限制,看来其实并不是。
从下面的数 ...

我的天。。。。3.3g下超内存,性能和功耗是不是会比3.6g不超内存高。。。

而且突然意识到,xeon skylake不能超内存,这性能估计是要完。。。
7#
chungexcy  楼主| 发表于 2017-7-9 16:42 | 显示全部楼层
本帖最后由 chungexcy 于 2017-7-9 16:48 编辑
royalk 发表于 2017-7-9 16:33
50m太快录不出准确值,跑1 billion功耗波动很大,在200-270w之间波动,而linpack轻易去到340w。
...

哦对,这个软件的功耗是有波动的,我锁TDP跑,能看到睿频能跳0.2-0.3g的幅度。。。

那个成绩差异,我不知道作者取的是pi的值,还是total computation time的值。HWbot上用的是total computation time。

PS,刚才去网站看了看,成绩用的是total computation time
8#
chungexcy  楼主| 发表于 2017-7-9 16:52 | 显示全部楼层
gtx9 发表于 2017-7-9 16:45
xeon虽然不能超内存,不过Xeon有6通道内存

也许2.2g核心主频,内存频率影响小点吧。不过我感觉没了之前的三级缓存做缓冲,6通道真的够28核的avx512吞吐量么。。。
9#
chungexcy  楼主| 发表于 2017-7-9 17:06 | 显示全部楼层
royalk 发表于 2017-7-9 16:54
全默认,只把内存超4000
3.3G,仍然录不出降频
最大功耗录得264w

厉害了,我的Intel。。。3.3g+4000MHz>4.0g+3200MHz。。。。。。。

对于超内存,有个疑问,现在都不用按照266或者133一档一档的超么
10#
chungexcy  楼主| 发表于 2017-7-9 17:22 | 显示全部楼层
royalk 发表于 2017-7-9 17:14
超内存从IVB开始就是可以用100/133的分频,SKL/KBL最高到31倍频,也就是133分频可以到4133,100分频是310 ...

哦哦,理解了
11#
chungexcy  楼主| 发表于 2017-7-10 09:26 | 显示全部楼层
royalk 发表于 2017-7-9 18:17
刚才对比了一下6950x平台,50million 单线程
6950x 4G/3.1ring/3200 15.072s
7900x 4G/2.4mesh/3200 10.1 ...

其实提升还是比较有限的。skylake的AVX2的整数运算单元是改进过的,比broadwell快了进15%。
不过现在看来,Skylake X由于内存瓶颈,估计比15%稍微低一点吧?AVX512对比avx2的提升,其实就是作者吐槽的27%上下。

http://www.numberworld.org/y-cruncher/versions.html
v0.7.2.9469版本就可以让skylake-x运行AVX2版本。

多线程下,内存瓶颈都能影响avx2了。。。我感觉cinebench r15是不是也是类似的情况。。。

6950x的内存频率,是不是对多线程的影响不大?

12#
chungexcy  楼主| 发表于 2017-7-10 09:36 | 显示全部楼层
gtx9 发表于 2017-7-9 20:44
这差距缩小得有点大啊。。会不会是50million计算规模不够大?

你看我一楼贴的那张表,多线程下,明显是内存瓶颈了,连Skylake X跑AVX2都影响很大。。。

13#
chungexcy  楼主| 发表于 2017-7-10 10:12 | 显示全部楼层
ydjj 发表于 2017-7-10 10:06
那张图的内存频率太低了,才2133和3200
I9能四通道超到4000以上,不知道还有没有瓶颈
...

如果2133甚至3200对avx2都有瓶颈,那4000以上也不可能够avx512的双倍吞吐量需求。。。
14#
chungexcy  楼主| 发表于 2017-7-10 13:46 | 显示全部楼层
royalk 发表于 2017-7-10 11:38
现在我跑整数类型的测试同频效能基本一样,完全没有发挥出整数单元的优势,扩大的L2好像也没发挥作用…
...

15%应该只是单指avx2的整数部分,avx2版的y-cruncher,以及x265,应该同频还是有提升的,除非又被内存限制了。。。别的整数以及浮点运算,应该也就和之前broadwell到skylake时的,3-4%吧,而且内存别来搞事。。。

ring大概这锅绝对摔不掉。。。即使ring的带宽不能随核心增加而增加,也比超到100GB/s读取的内存快多了,带22核跑avx2问题都不大。。。L2怎么说呢,要是由于改成mest导致L3几乎没用了,那之前每核心2-2.5MB,就直接降到1MB了。。。


本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x
15#
chungexcy  楼主| 发表于 2017-7-12 13:44 | 显示全部楼层
本帖最后由 chungexcy 于 2017-7-12 13:49 编辑
dogbear 发表于 2017-7-11 16:09
看了文章感觉这东西的支持局限性也很大,要I9才有、还会碰到内存带宽影响和过热问题。问一下AVX512目前的实 ...

一堆高性能科学计算应用都有做avx/avx512优化。所有用intel mkl跑计算的,比如matlab。

只要avx优化的不错的软件,avx512优化以后都会有一定提升的,即使是i7。

您需要登录后才可以回帖 登录 | 注册

本版积分规则

快速回复 返回顶部