有关推土机的架构问题，请大大们进来讨论下

royalk · 发表于 2010-10-7 13:59

一些有用的信息。来自AMD市场部的John Fruehe，消息的真实性应该有可靠，有兴趣的同学自己翻译：

关于缓存：

The L1 cache is comprised of two components, an instruction cache and a data cache. The instruction cache is 64K and is shared between the 2 integer cores in the module. The data cache is 16K and there is one dedicated data cache for each core in the module. The L2 cache holds a mixture of both data and instructions and is shared by the two integer cores. If only one thread is active, it will have access to the entire L2 cache. The L3 cache is shared at the die level, so on the “Interlagos” processor you will have 2 separate L3 caches, one for each die.

关于超线程：

We will have cores, real physical cores, and that leads to better overall scalability. In heavily optimized systems, you aren’t fighting over execution pipelines because every thread has its own integer core. There is less system overhead involved in parsing out the threads because cores are all pretty much equal.

Take this scenario: a 4 core processor with HyperThreading with have all 4 physical cores actively handling threads. Now you need to execute a 5th thread. Do you put that thread on an already active core, reducing the processing of the thread already on that core because the two threads now have to share the same execution pipeline, or, do you wait a cycle and hope that one of those cores frees up? There is a lot more decision making when you have “big cores and HT cores”, but in the AMD world, you could have 8 or 16 cores, so the 5th thread just goes onto the next available physical core. It is much easier and much more scalable.

关于软件对CMT的优化：

For the majority of software, the OS will work in concert with the processor to manage the thread to core relationships. We are collaborating with Microsoft and the open source software community to ensure that future versions of Windows and Linux operating systems will understand how to enumerate and effectively schedule the Bulldozer core pairs. The OS will understand if your machine is setup for maximum performance or for maximum performance/watt which takes advantage of Core Performance Boost.

However, let’s say you want to explore if you can get a performance advantage if your threads were scheduled on different modules.  The benefit you can gain really depends on how much sharing the two threads are going to do.

Since the two integer cores are completely separate and have their own execution clusters (pipelines) you get no sharing of data in the L1 – and there is no specific optimizations needed at the software level. However, at the L2 cache level there could be  some  benefits.  A shared L2 cache means that both cores have access to read the same cache lines – but obviously only one can write any cache line at any time. This means that if you have a workload with a main focus of querying data and your two threads are sharing a data set that fits in our L2, then having them execute in the same module could have some advantages. The main advantage we expect to see is an increase in the power efficiency of the cores that are idle.  The more idle other cores are, the better chance the busy cores will have to boost.

However, there is another consideration to this which is how available other cores are.  You need to weigh the benefits of data sharing with the benefit of starting the thread on the next available core. Stacking up threads to execute in proximity means that a thread might be waiting in line while an open core is available for immediate execution. If your multi-threaded application isn’t optimized to target the L2 (or possibly the L3 cache), or you have distinctly separate applications to run, and you don’t need to conserve power, then you’ll likely get better performance by having them scheduled on separate modules. So it is important to weigh both options to determine the best execution.

关于双线程用两个核心与用CMT执行的区别：

Without getting too specific around actual scaling across cores on the processor, let me share with you what was in the Hot Chips presentation. Compared to CMP (chip multiprocessing – which is, in simplistic terms building a multicore chip with each core having its own dedicated resources) two integer cores in a Bulldozer module would deliver roughly 80% of the throughput. But, because they have shared resources, they deliver that throughput at low power and low cost. Using CMP has some drawbacks, including more heat and more die space. The heat can limit performance in addition to consuming more power.

关于功耗：

Even though you’ll see processors with 33% more cores and larger caches than the previous generation, we’ll still be fitting them into the same power and thermal ranges that you see with our existing 12-core processors.

关于TurboCore：

Yes. There will be a Turbo CORE feature for “Bulldozer”, but there will be some improvements from what you see in “Thuban” (our 6-core AMD Phenom™ processor). There are some enhancements to give it more “turbo”. This will be the first introduction of the Turbo CORE technology in the server processors. We expect that this will translate into a big boost in performance when using single threaded applications, and there should be some interesting capabilities for heavier workloads as well. We’re pretty excited about how this will be implemented with “Bulldozer”, but the specifics of how this is implemented and the expected performance gains will not be disclosed until launch.