PCEVA,PC绝对领域,探寻真正的电脑知识

标题: 有关推土机的架构问题,请大大们进来讨论下 [打印本页]

作者: RaulMee    时间: 2010-10-7 13:47
标题: 有关推土机的架构问题,请大大们进来讨论下
本帖最后由 RaulMee 于 2010-10-9 12:33 编辑

前几天看了篇文章,介绍明年即将推出的推土机架构,里面提到了几点
1. 推土机将采用与Core架构一样的每指令四发射的技术,不知道AMD是以怎样的技术实现这个的,但只要是每指令四发射的架构,单核性能便不会和INTEL有K10时代的明显差距,最多在伯仲之间。

2. 推土机将大大加强整数运算部分,相对浮点运算部分相对K10不会有明显加强,因为GPU越来越重要的今天,也随着程序编写的改进,越来越重比例的浮点运算交给了GPU来运行。作为唯一的同时拥有顶级CPU和顶级GPU的厂商,显然,AMD并不需要对CPU的浮点运算部分有多大改进。

3. 推土机的创新化模块化技术,每模块由独特的1.5核心组成,第二个核心只占模块面积的12%,由此保证了出色的性耗比以及成本控制。

4. HT总线将进步到4.0

5. Anti-HT技术目前来看暂时不会出现在推土机上

--------------------------------华丽的分割线------------------------------------

这些观点不知是否准确,当然文章可能还有些细节我没记起来。我个人尚有疑问的就是:

推土机的内存控制器将有怎样的改进??

鄙人抛砖引玉,希望坛子里 的大大进来讨论下,能有所解惑。谢谢。
作者: royalk    时间: 2010-10-7 13:59
一些有用的信息。来自AMD市场部的John Fruehe,消息的真实性应该有可靠,有兴趣的同学自己翻译:

关于缓存:
The L1 cache is comprised of two components, an instruction cache and a data cache.  The instruction cache is 64K and is shared between the 2 integer cores in the module.  The data cache is 16K and there is one dedicated data cache for each core in the module.  The L2 cache holds a mixture of both data and instructions and is shared by the two integer cores.  If only one thread is active, it will have access to the entire L2 cache.  The L3 cache is shared at the die level, so on the “Interlagos” processor you will have 2 separate L3 caches, one for each die.


关于超线程:
We will have cores, real physical cores, and that leads to better overall scalability. In heavily optimized systems, you aren’t fighting over execution pipelines because every thread has its own integer core. There is less system overhead involved in parsing out the threads because cores are all pretty much equal.

Take this scenario: a 4 core processor with HyperThreading with have all 4 physical cores actively handling threads. Now you need to execute a 5th thread.  Do you put that thread on an already active core, reducing the processing of the thread already on that core because the two threads now have to share the same execution pipeline, or, do you wait a cycle and hope that one of those cores frees up? There is a lot more decision making when you have “big cores and HT cores”, but in the AMD world, you could have 8 or 16 cores, so the 5th thread just goes onto the next available physical core.  It is much easier and much more scalable.


关于软件对CMT的优化:

For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs.   The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.

However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules.  The benefit you can gain really depends on how much sharing the two threads are going to do.

Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be  some  benefits.  A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle.  The more idle other cores are, the better chance the busy cores will have to boost.

However, there is another consideration to this which is how available other cores are.  You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution.    If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules.   So it is important to weigh both options to determine the best execution.


关于双线程用两个核心与用CMT执行的区别:
Without getting too specific around actual scaling across cores on the processor, let me share with you what was in the Hot Chips presentation.  Compared to CMP (chip multiprocessing – which is, in simplistic terms building a multicore chip with each core having its own dedicated resources) two integer cores in a Bulldozer module would deliver roughly 80% of the throughput.  But, because they have shared resources, they deliver that throughput at low power and low cost.  Using CMP has some drawbacks, including more heat and more die space. The heat can limit performance in addition to consuming more power.


关于功耗:
Even though you’ll see processors with 33% more cores and larger caches than the previous generation, we’ll still be fitting them into the same power and thermal ranges that you see with our existing 12-core processors.


关于TurboCore:
Yes. There will be a Turbo CORE feature for “Bulldozer”, but there will be some improvements from what you see in “Thuban” (our 6-core AMD Phenom™ processor). There are some enhancements to give it more “turbo”. This will be the first introduction of the Turbo CORE technology in the server processors.  We expect that this will translate into a big boost in performance when using single threaded applications, and there should be some interesting capabilities for heavier workloads as well.  We’re pretty excited about how this will be implemented with “Bulldozer”, but the specifics of how this is implemented and the expected performance gains will not be disclosed until launch.

作者: skoe    时间: 2010-10-7 16:53
本帖最后由 skoe 于 2010-10-8 10:40 编辑

1.推土机的四发射是2ALU+2AGU形式,因此单核性能追上I7可能性较小。
2.四线程模式以上整数性能可能比SNB出色。
4.推土机采用的HT总线是3.1




欢迎光临 PCEVA,PC绝对领域,探寻真正的电脑知识 (https://bbs.pceva.com.cn/) Powered by Discuz! X3.2