PCEVA,PC绝对领域,探寻真正的电脑知识
打印 上一主题 下一主题
开启左侧

y-cruncher -0.7.3 发布,对支持 AVX512 的一些重要说明

[复制链接]
跳转到指定楼层
1#
chungexcy 发表于 2017-7-8 16:33 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
点击数:22540|回复数:70
本帖最后由 chungexcy 于 2017-7-12 11:17 编辑

http://www.numberworld.org/y-cruncher/

y-cruncher最近更新了对 Skylake-X 在AVX512上的支持,同时说了不少对这CPU的看法。我翻译了一下原文:
---------------------------------------------------------------------------------------------
Skylake X and AVX512: (July 6, 2017)

Let's talk about Skylake X and AVX512. Because everyone's been waiting for this. Since there's currently a lack of AVX512 benchmarks and stress tests. And because of that, I've had at least half a dozen people and organizations contact me about y-cruncher's AVX512.

Okay... some AVX512 benchmarks already existed. SiSoftware Sandra had some support. And my little-known FLOPs benchmark did too. But people either weren't aware of them, or wanted more. And by advertising y-cruncher's internal AVX512 support for at least a year now, I basically brought this on myself.

So let's get to the point. Unfortunately, AVX512 will not bring the "instant massive performance gain" that a lot of people were expecting. Realistically speaking, the speedups over AVX2 seem to vary around 10 - 50% - usually on the lower end of that scale. While the investigation is on-going, there are some known factors:
  • Not all Skylake X and Skylake Purley processors will have the full AVX512 capability.
  • "Phantom throttling" of performance when certain thermal limits are exceeded.
  • Memory bandwidth is a significant bottleneck.
  • Amdahl's law and other unknown scalability issues.

Skylake X和AVX512:(2017年7月6日)

我们来谈谈Skylake X和AVX512。因为每个人都在等待这个。由于目前缺乏AVX512基准测试和压力测试,因此至少有六个人和组织联系我关于y-cruncher的AVX512。

好吧,一些AVX512基准测试已经有了。 SiSoftware Sandra对此有一些支持。而我的鲜为人知的FLOPs基准测试也是如此。但是人们要么不知道他们,要么就是想要更多。而y-cruncher的内部AVX512支持,至少已经宣传了一年了,至少我自己是这么想的。

所以我们来看一下吧。不幸的是,AVX512不会带来很多人期待的“立即巨大的性能增益”。实际上,相比于AVX2加速,AVX512似乎在10-50%之间变化 - 通常在小规模上。现在的调查还在进行中,不够一些已知的因素如下:
  • 并不是所有的Skylake X和Skylake Purley处理器都将拥有完整的AVX512功能。
  • 当超过某些过热限制时,性能会被“Phantom限制”。
  • 内存带宽是一个明显的瓶颈。
  • Amdahl的定律和其他未知的可扩展性问题。


Not all Skylake X and Skylake Purley processors will have the full AVX512 capability:

While this reason doesn't apply to my system, it's worth mentioning it anyway.

Architecturally, Skylake X retains Skylake desktop's architecture with 2 x 256-bit FMA units. In Skylake X, those two 256-bit FMA units can merge to form a single 512-bit FMA. On the processors with full-throughput AVX512, there is also a dedicated 512-bit FMA - thereby providing 2 x 512-bit FMA capability.

However, that dedicated 512-bit FMA is only enabled on the Core i9 parts. The 6-core and 8-core Core i7 parts are supposed to have it disabled. Therefore they only have half the AVX512 performance.

It's worth mentioning that there is abenchmark on an engineering-sample 6-core Core i7 that shows full-throughput AVX512 anyway. However, engineering sample processors are not always representative of the retail parts.

So as of this writing, I still don't know if the 6 and 8-core Skylake X Core i7's have the full AVX512. The only Skylake X processor I have at this time is the Core i9 7900X which is supposed to have the full AVX512 anyway. (and indeed it does based on my tests)

并不是所有的Skylake X和Skylake Purley处理器都将拥有完整的AVX512功能:

虽然这个原因不适用于我的系统,但这点还是值得一提的。

SkylakeX在架构上保留了Skylake桌面架构的2 x 256位FMA单元。在Skylake X中,这两个256位FMA单元可以合并形成一个512位的FMA。在具有全吞吐量AVX512的处理器上,还有一个专用的512位FMA,从而提供2 x 512位FMA功能。

但是,专用的512位FMA仅在Core i9上启用。 6核心和8核Corei7部件应该是禁用的。所以他们只有一半的AVX512性能。

值得一提的是,在工程样品6核的Corei7上有一个基准测试,显示全吞吐量的AVX512。然而,工程样品CPU并不总是代表零售样品。

所以在撰写本文时,我还是不知道6和8核心的Skylake X Core i7是否有完整的AVX512。目前我唯一拥有的Skylake X处理器是Core i9 7900X,它应该有完整的AVX512。 (实际上根据我的测试也确实如此)


"Phantom throttling" of performance when certain thermal limits are exceeded:

Within minutes of getting my system setup, I started noticing inconsistencies in performance. And after spending a long Friday night investigating the issue, I determined that there was a sort of "Phantom throttling" of AVX512 code when certain thermal limits are exceeded.

"Phantom throttling" is the term that I used to describe the problem in my emails with the Silicon Lottery vendor. And it looks like I'm not the only one using that term anymore. Phantom throttling is when the processor gets throttled without a change in clock frequency. For many years, processors have throttled down for many reasons to protect it from damage. But when throttling happens, it has always been done by lowering the clock frequency - which is visible in a monitor like CPUz. Skylake X is the first line of processors to break from this and it makes it more difficult to detect the throttling.

Right now, the phantom throttling phenomenon is still not well understood. Overclocker der8auer has mentioned that it could be caused by CPUz not reacting fast enough to actual clock frequency changes. On the other hand, the tests that Silicon Lottery and myself have done seem to show the that there really is no drop in clock frequency at all.

Initially, I observed this effect only with AVX512 code and thus hypothesized that the mechanism behind the throttling is the shutdown of the dedicated 512-bit FMA. But others have found that phantom throttling also occurs on AVX and scalar code as well. In short, much more investigation is needed. The lack of AVX512 programs out there certainly doesn't help and is partially why I'm rushing this release of y-cruncher v0.7.3.

Currently, there are no known reliable ways of stopping the throttling and results vary heavily by motherboard manufacturer. But maxing out thermal limits and disabling all thermal protections seems to help. (Don't try this at home if you don't know what you're doing or you aren't at least moderately experienced in overclocking. You can destroy your processor and/or motherboard if you aren't careful.)

当超过某些过热限制时,性能会被“Phantom限制”:

在我的系统做好的几分钟之内,我开始注意到性能不一致。在花了一个漫长的星期五晚上调查这个问题后,我确定当超过某些过热限制时,AVX512代码有一种“Phantom”的限制。

“Phantom限制”是我在与Silicon Lottery供应商的邮件中,用来描述这个问题的术语。看来我不是唯一一个使用这个术语的人了。Phantom限制是当处理器遇到限制而没有发生时钟频率的变化。多年来,处理器因为许多原因而被限制,以防止损坏。但是当限制发生时,一般都是通过降低时钟频率来实现,这种现象在CPUz这样的软件监控中是可见的。 SkylakeX是第一个打破这一点的处理器,它使的对限制的检测更加困难。

现在,Phantom限制现象还不太能清楚的理解。超频玩家der8auer提到这可能是由于CPUz对实际的时钟频率变化没有足够的时间反应。另一方面,Silicon Lottery和我自己所做的测试则似乎表明,根本没有时钟频率的下降。

最初,我也只是在AVX512代码上观察到这种效果,因此假设Phantom限制的机制是因为关闭了专用的512位FMA。但是其他人发现Phantom限制也发生在AVX和标量(非SIMD)代码上。总之,这还需要更多的调查。缺少AVX512程序肯定没有帮助,这也是为什么我急忙推出这个版本的y-cruncher v0.7.3原因的其中之一。

目前,没有已知的可靠的方法来阻止Phantom限制,不同的主板的结果也有很大差异。但是,过热限制调到最大并禁用所有过热保护似乎有所帮助。 (如果你不知道自己在做什么,或者你在超频中至少没有适度的经验,请不要在家中尝试。如果不小心,你会烧毁你的处理器和/或主板。)


Memory bandwidth is a significant bottleneck:

y-cruncher was already slightly memory-bound on Haswell-E. Now on Skylake X, it is much worse. While I had anticpiated a memory bottleneck on Skylake X with AVX512, it seems that I've underestimated the severity of it:

(The CPU frequencies in this benchmark were chosen to be low enough to avoid any throttling or phantom throttling.)

1 billion digits of Pi - Core i9 7900X @ 3.8 GHz - Times in Seconds
  1. Threads         Memory Frequency         Instruction Set
  2.                                          AVX2         AVX512
  3. 1 thread
  4.                      2133 MHz         444.434         325.543
  5.                      3200 MHz         438.432         319.737
  6. 20 threads
  7.                      2133 MHz          51.884         45.658
  8.                      3200 MHz          47.672         39.723
复制代码

In the single threaded benchmarks, the memory frequency has less than 2% effect for both AVX2 and AVX512. Multi-threaded, that jumps to 9% and 15% respectively. This is much more than is expected for a program that used to be completely compute-bound just a few years ago.

内存带宽是一个重大的瓶颈:

y-cruncher在Haswell-E上已经略有一点遇上内存带宽的瓶颈。现在在Skylake X上,情况更糟。虽然我已经预见AVX512在Skylake X的内存瓶颈,但似乎我低估了它的严重性:

(该基准测试中的CPU频率被选择为足够低以避免任何限制或Phantom限制。)
        
在单线程基准测试中,内存带宽的影响在AVX2和AVX512上均小于2%。这在多线程上突然增大到9%和15%。这远远超过了几年前以前对这种完全计算的程序的内存瓶颈表现预期。


Amdahl's law and other unknown scalability issues:

In a typical y-cruncher computation, only about 80% of the CPU time is spent running vectorized code when AVX2 is used. So by Amdahl's law, even if we get perfect scaling with the AVX512, we can only cut 40% off the run-time. Right now, the single-threaded benchmarks (which are least memory-bound) are only showing 27% speedup with AVX512 over AVX2.

This remaining 13% discrepancy is currently unresolved. Microbenchmarks of y-cruncher's AVX512 code show near perfect 2x speedups over AVX2. (Some show >2x thanks to the increased register count.) But this speedup seems to drop off as the data sizes increase - even while still fitting in cache. This seems to hint at unknown bottlenecks within the L2 and L3 caches. The fact that cache sizes haven't increased along with wider the SIMD also doesn't help.

For now, investigation is difficult because none of my performance profilers support Skylake X yet.

阿姆达尔定律和其他未知的可扩展性问题:

在典型的y-cruncher计算中,当使用AVX2时,只有80%的CPU时间用于运行矢量化代码。所以根据阿姆达尔定律,即使我们使用AVX512进行了完美的缩放,我们只能把运行时间内减少40%。而现在,即使是单线程的(内存瓶颈最少的)基准测试,也只能看到AVX512比AVX2有27%的性能提升。

目前剩余的13%还尚未解决。 y-cruncher的AVX512代码的微基准测试,表现出了与AVX2相当完美的两倍加速(由于增加的寄存器数量,有些表现出> 2x)。但是这种加速似乎随着数据大小的增加而下降 ( 即使其数据量大小仍能完全跑在缓存里)。这似乎暗示L2和L3高速缓存中的未知瓶颈。事实上,缓存大小没有随着更宽的SIMD增加,这一点也没有帮助。

现在调查很困难,因为我的性能分析器都没有支持Skylake X。


Implications for Stress-Testing:

y-cruncher's failure to achieve a decent speedup for AVX512 also means that it is unable to put a heavy load on the AVX512 computation units. Therefore it is not a great stress-test for Skylake X with full AVX512.

But there is one y-cruncher feature which seems to be unaffected - the BBP benchmark.

The BBP benchmark feature is contained entirely in cache is thus free of the memory bottleneck. It is able to put a much higher stress than the stress-tester and the computations. So if you run the BBP benchmark (option 4) and set the offset to 100 billion, you can still put a pretty heavy load on your AVX512-capable processor.

A future version of y-cruncher will revamp the stress-tester to incorporate the BBP benchmark as well as other possible improvements.

压力测试的影响:

对于AVX512,y-cruncher未能实现全面的加速,这也意味着无法对AVX512计算单元造成沉重的负担。因此,对于具有完整AVX512的Skylake X来说,这不是很大的压力测试。

但是有一个y-cruncher功能似乎不受影响 - BBP基准测试。

BBP基准测试功能完全跑在缓存中,因此没有内存瓶颈。它能够比压力测试软件和计算软件造成更高的压力。因此,如果你运行BBP基准测试(选项4)并将偏移量设置为100billion,你仍然可以在支持AVX512的处理器上造成相当大的负担。

y-cruncher的未来版本将重新改进压力测试部分,以纳入BBP基准以及其他可能的改进。

---------------------------------------------------------------------------------------------

个人对
y-cruncher说法的解读:

1. AVX512计算单元:

我同意作者的说法,skylake-x 应该就是 (avx256+avx256) + avx512,前俩可以拼成avx512来用。
The ports are split as in the diagram, with two load/store units, one dedicated store address unit and one dedicated store data unit. The four ALU ports support a subset of all the ALU features, which Intel states has been balanced over Haswell. The full breakdown is as follows:

    Port 0: ALU/Vec ALU, Vec Shft/Vec Add, Vec Mul/FMA, DIV, Branch2
    Port 1: ALU/Vec ALU/Fast LEA, Vec Shift/Vec Add, Vec Mul/FMA, Slow Int, Slow LEA
    Port 2: Load/Store Address
    Port 3: Load/Store Address
    Port 4: Store Data
    Port 5: ALU/Vec ALU/Fast LEA, Vec Shuffle, (FMA on 10-core SKL-X)
    Port 6: ALU/Shift, Branch1
    Port 7: Store Address

以上是anandtech的描述,Port0和Port1各有一组FMA256,有一组FMA512在Port5。
Nominally the FMAs on ports 0 and 1 are 256-bit, so in order to drive towards the AVX-512-F these two ports are fused together, similar to how AVX-512-F is implemented in Knights Landing. The six-core and eight-core Skylake-X parts support one fused FMA for AVX-512-F, although the 10-core will support dual 512-bit AVX-512-F ports, which seems to be located on port 5. This means that the 10-core i9-7900X can support 64 SP or 32 DP calculations per cycle, whereas the 8-core/6-core parts can support 32 SP or 16 DP per cycle.

这组AVX512怎么实现的,我觉得其实并不重要。即使是两组拼起来的,你也只能把它当成一个整体AVX512使用,也不能当成多余的两组AVX256来用。
也就是说,原来的AVX256优化的程序,优化的再完美,也不会有特别的提升,因为它能看到的依然是原来的两组AVX256。除非用AVX512重新优化。

更新:Skylake-SP的PPT已经确认了。



2. AVX512的寄存器,宽度和数量翻倍:

AVX256时代,256位的寄存器只有16个,YMM0–YMM15。
AVX512,首先把256位的寄存器提升到了512位,同时把数量也提升到了32个,ZMM0-ZMM31。
相比AVX256,寄存器里能暂存的数据,是原来的4倍。双倍的寄存器数量,能带来更好的命中率,进而减小一级缓存的读写压力。


3. 一级缓存带宽翻倍:

根据anandtech的说法,Intel实际是把一级缓存改进了:读取带宽从2x256bit/cycle,提升到了2x512bit/cycle;写入带宽从256bit/cycle,提升到了512bit/cycle。
简单的理解是满足a+b=c,读取a,b花一个周期,写回c也花一个周期c,每个数都是512位。
The four load/store related units serve the writeback 32KB L1 Data cache with 8-way associativity and 4-cycle latency. In Skylake-S, this supported two 32-byte reads and one 32-byte store per cycle: in the new Skylake-SP core, this is doubled although Intel only stated that it was ‘128 bytes per cycle read and 64 bytes per cycle write’, which we would assume to mean 2x64B read and 1x64B write. This is backed by an L1D TLB, supporting 64x4KB entries per thread with 4-way associativity.

参考上图intel的ppt。AIDA64没能测出来,我估计是因为AIDA64现在的一缓测试,没有支持512位的缘故吧。7900x@4Ghz,10核心加起来,应该有5120GB/s的一缓读取性能。现在的测试都没能很好的跑出L1的极限,有待优化。techreport的测试倒是可以看出一些端倪:


一缓大小依然是32k data和32k inst,这一点从core i系列第一代 Nehalem 就没变过。


4. Phantom throttling:

之前 @royalk 在预告里就和我提到了这个问题,这个感觉只能有些猜想。
基础频率撞到功耗墙,估计是Phantom throttling;撞倒温度墙应该就是直接高温的核心降频了。
如果是睿频下撞功耗墙,应该先不断降频直到功耗刚好等于TDP。如果都降到基础频率,功耗依然过大,这时候应该就是Phantom throttling了。
以上都是我的猜想。

Phantom throttling,我比较同意作者的猜想,可能是关闭部分计算单元来缓轻功耗压力,比如两FMA关一个,或者两ALU关一个。如果真是这样,我觉得这种做法的能耗比远不如降频。打个比方,多核满载下,关核心不如稍微降低一点频率。


5. 内存带宽:
我印象中y-cruncher在这代以前,是内存不敏感的。这次单线程1.5%,10核20线程9-15%的差距,我猜可能是因为L3变成排除式的原因,可能内存的性能直接关乎L2。这点AIDA64没更新依然不好去证实或实验。AIDA64 v5.92依然未提到AVX512。


6. AVX512的边际效率问题:
阿姆达尔定律:

并行计算的加速比极限,是由不能并行的比例决定的。比如你只有90%的能够并行,那么你就算有再多的核心,一瞬间完成全部并行的部分,也只能加速到原来的10倍!

那么问题来了:

1. 一个程序,假如不能并行的部分需要花费60s,能并行的部分需要花费160s,总共是220s运行时间,这是标量计算需要的时间。
2. 现在你对能并行部分进行SSE2的优化,原来160s的计算减少到了40s,总时间减少到了100s。这时,你发现Intel的sse,让你的代码性能提升了1.2倍!!!
3. 某天,你发现intel提供了AVX,这时你把代码优化到了AVX,原来40s的降低到了20s,总时间减少到了80s。这时你发现提升只有25%了,之前可是120%的。
4. 你又用AVX512进行了优化,这次把剩余的20s降低到了10s,总时间降低到了70s,提升只有12%了。

以上还是理想情况,220s->100s->80s->70s。这个例子就是现在x264编码器的现状。

y-cruncher的avx256的单线程并行运行比例是80%,之前SSE到AVX2,甚至超过了2倍的提升,提升非常完美。
然而,AVX512理想情况是降低一半的40%。实际情况却是,单线程降低了27%(此时内存都不是瓶颈)。3200的内存下,20线程只降低了17%,2133的内存竟然只有12%,瓶颈明显在各级缓存和内存带宽上了,这个是最难优化的。
如果你比较avx2@3200MHz内存,与AVX512@2133Mhz内存,性能差距仅仅4.2%。。。Skylake X/EP的内存频率的影响,应该比想象中的大得多。

以下是5960x和优化后的7900x的对比。所以,不是每个程序都有linpack的并行性。


曾经Skylake桌面版是打算上AVX512的,可能这也是Intel为什么一再推迟AVX512的原因之一吧。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x

评分

参与人数 2代金券 +100 绝对值 +1 收起 理由
石头 + 100
ydjj + 1 好文学习

查看全部评分

2#
gtx9 发表于 2017-7-8 16:51 | 只看该作者
本帖最后由 gtx9 于 2017-7-8 17:00 编辑

intel 官方的测试,算上核心数的差异

LINPACK 的AVX512效率是80%左右






Through its integration of the Intel® Advanced Vector Extensions 512 (Intel® AVX-512), the platform generates 2X FLOPs/clock-cycle peak improvements, offering a boost to performance for demanding use.1 Intel AVX-512 combined with improvements in cores, cache and memory, delivers up to 2.27x more performance than today’s Intel Xeon processor E5 v4 (formerly codenamed Broadwell), and up to 8.2x more double precision GFLOPS/second when compared to a 4-year old Intel Xeon processor E5 family in the installed base.


Baseline config: 1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Red Hat Enterprise Linux* 7.0 kernel 3.10.0-123 using Intel® Distribution for LINPACK Benchmark, score: 1446.4 GFLOPS/s vs. estimates based on Intel internal testing on 1-Node, 2x Intel Xeon Scalable processor (codename Skylake-SP) system. Score: 3295.57

Baseline config: 1-Node, 2 x Intel® Xeon® Processor E5-2690 based system on Red Hat Enterprise Linux* 6.0 kernel version 2.6.32-504.el6.x86_64 using Intel® Distribution for LINPACK Benchmark. Score: 366.0 GFLOPS/s vs. 1-Node, 2 x Intel® Xeon® Scalable process on Ubuntu 17.04 using MKL 2017 Update 2. Score: 3007.8
3#
junweb 发表于 2017-7-8 17:03 | 只看该作者
不错,涨知识了。
4#
chungexcy  楼主| 发表于 2017-7-8 17:20 | 只看该作者
gtx9 发表于 2017-7-8 16:51
intel 官方的测试,算上核心数的差异

LINPACK 的AVX512效率是80%左右

AVX2的linpack也就是80%多点。这样看,skylake sp的实际linpack频率,要比broadwell sp低10%。

5#
ydjj 发表于 2017-7-8 17:21 | 只看该作者
感谢好文
高频内存终于光明正大地影响性能了
6#
gtx9 发表于 2017-7-8 17:34 | 只看该作者
本帖最后由 gtx9 于 2017-7-8 17:59 编辑

还有

已经支持AVX512的SiSoftware缓存测试

和AIDA64结果一样





本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x
7#
OstCollector 发表于 2017-7-8 17:40 | 只看该作者
我在想,到时候各种并行算法的benchmark怎么做
8#
royalk 发表于 2017-7-8 18:33 | 只看该作者
其他部分的说明基本符合我的测试情况,只是phantom throttling的问题,估计在Intel白皮书vol2出来之前没有办法能知道它的细节,但确实有一种看不见的处理器性能下降情况存在,这种降频不仅表现在运行avx512,也表现在超频后的某些测试中,怀疑FIVR就有一层throttle机制,并且这个情况是目前所有监控软件都抓不到的,只能从满载测试中CPU温度突然往下掉的现象感知到。
来自安卓客户端来自安卓客户端
9#
royalk 发表于 2017-7-8 18:37 | 只看该作者
OstCollector 发表于 2017-7-8 17:40
我在想,到时候各种并行算法的benchmark怎么做

目前来说这代频率定的太高,只能是勉强运行avx512,性能还受限。如果我预计的没错7980xe运行avx512的频率大概是2.5-2.8GHz,电压在0.85v或更低,这样可以实现Intel宣传资料里的1TFlops算力,也可以避免了像7900x那样功耗爆炸的情况。

改天我会验证默认频率下使用默认电压和降压对比,看看avx512的性能有无提升。
来自安卓客户端来自安卓客户端
10#
gtx9 发表于 2017-7-8 19:30 | 只看该作者
本帖最后由 gtx9 于 2017-7-8 19:33 编辑
royalk 发表于 2017-7-8 18:37
目前来说这代频率定的太高,只能是勉强运行avx512,性能还受限。如果我预计的没错7980xe运行avx512的频率 ...

intel官方测试双路56核的skylake-sp(205w)是 3295.57GFlops(LINPACK)

算下来单路205w功耗下1647GFlops


165w 的18核7980xe如果跑2.5Ghz,差不多就是1100GFlops左右,2.8Ghz感觉不太可能(前提是不解功耗)



  1. Baseline config: 1-Node, 2 x Intel® Xeon® Processor E5-2699 v4 on Red Hat Enterprise Linux* 7.0 kernel 3.10.0-123 using Intel® Distribution for LINPACK Benchmark, score: 1446.4 GFLOPS/s vs. estimates based on Intel internal testing on 1-Node, 2x Intel Xeon Scalable processor (codename Skylake-SP) system. Score: 3295.57
复制代码


11#
royalk 发表于 2017-7-8 22:02 | 只看该作者
gtx9 发表于 2017-7-8 19:30
intel官方测试双路56核的skylake-sp(205w)是 3295.57GFlops(LINPACK)

算下来单路205w功耗下1647GFlop ...

是的,所以7980XE的频率其实也没什么TBA的了,基本就这数,不会再高了。
那就出现一个问题,AMD的Ryzen 16核,3.4G,如果跑AVX256,可能比intel还快。
12#
gtx9 发表于 2017-7-8 22:14 | 只看该作者
royalk 发表于 2017-7-8 22:02
是的,所以7980XE的频率其实也没什么TBA的了,基本就这数,不会再高了。
那就出现一个问题,AMD的Ryzen 16 ...

目前AVX软件还是很少。。。估计在调教非AVX频率

预计7980XE

基础频率2.8-3.0Ghz(非AVX)

全核满载3.4-3.5Ghz(非AVX)

TB3.0 应该可以4G+
13#
chungexcy  楼主| 发表于 2017-7-8 23:48 | 只看该作者
gtx9 发表于 2017-7-8 17:34
还有

已经支持AVX512的SiSoftware缓存测试


http://techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one/4

我感觉sisoftware的缓存带宽测试还有待优化,数值也不能直接对比别的测试软件。
数据大小在32K*10内,峰值冲到了6950x的1.5x以上。

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x
14#
haomingci3 发表于 2017-7-9 10:40 | 只看该作者
royalk 发表于 2017-7-8 22:02
是的,所以7980XE的频率其实也没什么TBA的了,基本就这数,不会再高了。
那就出现一个问题,AMD的Ryzen 16 ...

我觉得AVX256上ryzen还是跑不过,0.6G频率差弥补不回原生差一半的FMA数,而且还有两个核的差距,AVX1倒是可能超越7980XE

本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?注册

x
15#
royalk 发表于 2017-7-9 11:54 | 只看该作者
gtx9 发表于 2017-7-8 22:14
目前AVX软件还是很少。。。估计在调教非AVX频率

预计7980XE

非AVX 3.5应该没问题,TB3.0只是双核,应该4.5没问题。
我估计Intel在评估是否还要再放宽TDP。。。

16#
royalk 发表于 2017-7-9 11:56 | 只看该作者
haomingci3 发表于 2017-7-9 10:40
我觉得AVX256上ryzen还是跑不过,0.6G频率差弥补不回原生差一半的FMA数,而且还有两个核的差距,AVX1倒是 ...

512的效率是捉急,但是这代渲染和编解码类IPC有小幅提升,大概在7-10%,如果是2.5到3.4G的差距,那么ryzen还是有机会,至少应该能和7960X肛一下
17#
chungexcy  楼主| 发表于 2017-7-9 12:13 | 只看该作者
royalk 发表于 2017-7-9 11:56
512的效率是捉急,但是这代渲染和编解码类IPC有小幅提升,大概在7-10%,如果是2.5到3.4G的差距,那么ryze ...

这个评测是基于AVX2的,50mililion,默认频率是1.795s。
按照之前贴出的表格,50mililion用avx512,能下降到1.475s,不过提升依然有限。

18#
royalk 发表于 2017-7-9 12:25 | 只看该作者
chungexcy 发表于 2017-7-9 12:13
这个评测是基于AVX2的,50mililion,默认频率是1.795s。
按照之前贴出的表格,50mililion用avx512,能下 ...

回头我跑一下,这个提升幅度其实已经算是可以了。。毕竟不是linpack纯跑AVX。。
19#
inSeek 发表于 2017-7-9 13:44 | 只看该作者
Phantom throttling 这类似的概念的东西,我听媒体朋友说,在AMD的GPU上出现过。频率不变,但是功耗/性能下降的情况,似乎也是触发后不再给一些计算单元派任务造成的。
20#
royalk 发表于 2017-7-9 15:03 | 只看该作者
chungexcy 发表于 2017-7-9 12:13
这个评测是基于AVX2的,50mililion,默认频率是1.795s。
按照之前贴出的表格,50mililion用avx512,能下 ...

刚跑了一下build 9471,4G/2.4/3200的时候是1.577s
您需要登录后才可以回帖 登录 | 注册

本版积分规则

快速回复 返回顶部