Barcelona Architecture: AMD on the Counterattack
by Anand Lal Shimpi on March 1, 2007 12:05 AM EST- Posted in
- CPUs
Getting Spendy with Transistors - L3 cache
AMD lost the cache race to Intel long ago, but that's more of a result of manufacturing capacity than anything else. AMD knew it could not compete with Intel's ability to churn out more transistors on smaller processes faster, so it did the next best thing and integrated a memory controller. With the K8's on-die memory controller, AMD reduced the need for larger caches, which is why even current Athlon 64 X2s only have a 512KB L2 cache per core - a figure that Intel introduced back in 2002 with its Northwood core.
These days two Core 2 cores share up to 4MB of L2 cache, while the fastest offerings from AMD weigh in at half that. The gap will continue to widen with Barcelona, as each of its four cores will only have a 512KB L2 cache. While a quad-core Barcelona chip will have 2MB of total L2 cache for all four cores, a quad-core Kentsfield currently has 8MB of L2 cache for all four cores. By the end of this year, Intel's Penryn is expected to have 12MB of L2 cache for all of its cores.
In order to keep die sizes manageable, AMD constructed its quad-core Barcelona out of four cores each with a 128KB L1 and 512KB L2, much like most mainstream K8 based products today. However, the era of multithreaded applications demands that multi-core CPUs should have some common pool of high speed memory to keep them running at peak efficiency.
With four cores sharing a single die, AMD didn't want to complicate its design by introducing a large unified L2 cache. Instead, it took the K8 cache hierarchy and added a third level of cache to the mix - shared among all four cores. At 65nm, a quad-core Barcelona will have a 2MB L3 cache that is shared by all four cores.
The hierarchy in Barcelona works like this: the L2 caches are filled with victims from the L1 cache. When a cache gets full, data that was not recently used is evicted to make room for new data that the cache controller determines is good to keep in the cache. In a victim cache structure, the evicted data is placed in a storage area known as a victim cache instead of being removed from cache all together. If the data should become useful again, the cache controller simply has to fetch it from the victim cache rather than much slower main memory; victims from Barcelona's L1 are kicked out to the L2 cache.
The new L3 cache, acts as a victim for the L2 cache. So when the small L2 cache fills up, evicted data is sent to the larger L3 cache where it is kept until space is needed. The algorithms that govern the L3 cache's operation are designed to accommodate data that is likely to be needed by multiple cores. If the CPU fetches a bit of code, a copy is left in the L3 cache since the code is likely to be shared among the four cores. Pure data load requests however go through a separate process. The cache controller looks at history and if the data has been shared before, a copy will be left in the L3 cache; otherwise it will be invalidated.
Associativity hasn't been changed for the L1 and L2 caches; they are still 2-way and 16-way set associative, respectively. However, the new L3 cache is 32-way set associative. It has been designed to increase the hit rate of a relatively small cache compared to its competition.
AMD lost the cache race to Intel long ago, but that's more of a result of manufacturing capacity than anything else. AMD knew it could not compete with Intel's ability to churn out more transistors on smaller processes faster, so it did the next best thing and integrated a memory controller. With the K8's on-die memory controller, AMD reduced the need for larger caches, which is why even current Athlon 64 X2s only have a 512KB L2 cache per core - a figure that Intel introduced back in 2002 with its Northwood core.
These days two Core 2 cores share up to 4MB of L2 cache, while the fastest offerings from AMD weigh in at half that. The gap will continue to widen with Barcelona, as each of its four cores will only have a 512KB L2 cache. While a quad-core Barcelona chip will have 2MB of total L2 cache for all four cores, a quad-core Kentsfield currently has 8MB of L2 cache for all four cores. By the end of this year, Intel's Penryn is expected to have 12MB of L2 cache for all of its cores.
In order to keep die sizes manageable, AMD constructed its quad-core Barcelona out of four cores each with a 128KB L1 and 512KB L2, much like most mainstream K8 based products today. However, the era of multithreaded applications demands that multi-core CPUs should have some common pool of high speed memory to keep them running at peak efficiency.
With four cores sharing a single die, AMD didn't want to complicate its design by introducing a large unified L2 cache. Instead, it took the K8 cache hierarchy and added a third level of cache to the mix - shared among all four cores. At 65nm, a quad-core Barcelona will have a 2MB L3 cache that is shared by all four cores.
The hierarchy in Barcelona works like this: the L2 caches are filled with victims from the L1 cache. When a cache gets full, data that was not recently used is evicted to make room for new data that the cache controller determines is good to keep in the cache. In a victim cache structure, the evicted data is placed in a storage area known as a victim cache instead of being removed from cache all together. If the data should become useful again, the cache controller simply has to fetch it from the victim cache rather than much slower main memory; victims from Barcelona's L1 are kicked out to the L2 cache.
The new L3 cache, acts as a victim for the L2 cache. So when the small L2 cache fills up, evicted data is sent to the larger L3 cache where it is kept until space is needed. The algorithms that govern the L3 cache's operation are designed to accommodate data that is likely to be needed by multiple cores. If the CPU fetches a bit of code, a copy is left in the L3 cache since the code is likely to be shared among the four cores. Pure data load requests however go through a separate process. The cache controller looks at history and if the data has been shared before, a copy will be left in the L3 cache; otherwise it will be invalidated.
Associativity hasn't been changed for the L1 and L2 caches; they are still 2-way and 16-way set associative, respectively. However, the new L3 cache is 32-way set associative. It has been designed to increase the hit rate of a relatively small cache compared to its competition.
83 Comments
View All Comments
agaelebe - Friday, March 2, 2007 - link
Wow! A lot of dicussion in here.And, by the way, very interesting article.
I'm a software engineer from Brazil and I'm planning to change my PC this year.
I've bem using AMD processors since the K6.
Today I've a XP Mobile 2500+(@2.2ghz), 1gb ram, 200gb and an AGP 6600GT
My PC is not very slow, but I'm thinking in going dual core to speed things up(office applications, web development and some games).
I can run some of the newest games, but not in high graphics.
I expect that my PC can run C&C 3 (Already run the demo in 1024 medium, but have some craches although it's not running it slow)
So, today I'm thinking in 3 options:
1) Stay with this computer and wait until AMD launchs it's new architecture (I pretend to go with an average price Kuma)
2) Go with Intel Core 2 Duo (e6300 or e6400). They're not expensive and for games I can easily make an overclock and gain more performance.
3) Buy a good AM2 board and a cheap Atlhon X2 (3600) and wait new AMD processors and then change only the processor.
Here in Brazil the taxes are to high, so I'm planning in buying a PC with these specs:
- CORE 2 Duo e6300/6400 or X2 3600/3800
- mid-tier motherboard (
- 2 x 1gb DDR 800 4-4-4-12
- 2 x 250 gb
- X1950pro 256 or 512
- 500watts power
So the prices are below:
e6300 box US$ 300 (same price for a X2 4200+ box)
x23800 box US$ 220
motherboard: US$ 220
ram: US$ 400
video: US$ 450
DVD: US$ 70
case: US$ 150
HDs : US$ 250
Power: us$ 180
So I plan to spent about 2000 dollars (Sadly, I can buy this same PC in US for the half of the price).
My new PC should spent not to much power so I can leave it turned onall day long(max 150watts on iddle without monitor), otherwise I'll keep my old computer turned on just for downloding stuff)
So, If someone has an opinion, I'd like to "hear" it. You can give another options to, or make some comments about the specs I'm choosing now.
I had Pentium 75 and after that only AMD CPUs... Should know I surrender to the Core 2 Duo or believe that AMD can really beat it until the end of 2008?
And thanks for the cooperation and patience.
Zebo - Saturday, March 3, 2007 - link
Athlon 64 AM2's arnt exactly slow so if you're an AMD fan just get one..like a 3800+ or 3600+ and overclock it. It will be at least 4x faster than what you have now and accept K8L Agena core later. It will be cheaper than C2D by about $50 USD and You'll also pay cheap for a GeForce 6100 Motherboard which is only $50 USD. Overall expect the the AM2 system to be about $100 USD cheaper.Keep in mind that C2D is 20% faster clock for clock in most apps so it's not exactly a quantum leap here getting a C2D.. Gap gets a lot larger when overclocking since C2D's overclcok higher like 3.2Ghz is common on air vs. only 2.8Ghz for AM2, so, at the end of the day a C2D setup is able to be about 40% faster over most benchmarks. That is getting significant and why enthusiasts are buying C2D's.
agaelebe - Friday, March 2, 2007 - link
And,as always, sorry with the errors and not so good writing...Kiijibari - Thursday, March 1, 2007 - link
Hi,never heard of of that before, does anybody know what it is ?
So far I see 2 pad areas at the DIE photo, therefore I assume that it would be also 2 interfaces, e.g. x8 PCIe like Sun uses ?
bb
Kiijibari
mino - Friday, March 2, 2007 - link
It should be some management/coodrination stuff (can-t remember the name of that bus).Every northbridge and CPU has that.
davecason - Thursday, March 1, 2007 - link
Anand,Great article! I know it took a lot of time and I wanted you to know I really appreciate your effort. It is the kind of article that keeps me coming back to your site.
-Dave
yyrkoon - Thursday, March 1, 2007 - link
Page 5, paragraph 4 'pretty significantly'. Well is it, or is it not it ?
http://www.wikihow.com/Avoid-Colloquial-%28Informa...">http://www.wikihow.com/Avoid-Colloquial-%28Informa...
Aside from my gripe concerning writing style, good article :)
trisweb2 - Friday, March 16, 2007 - link
Usually we criticize writing style based on a whole experience... obviously Anand is one of the best technical review writers on the Internet; if you bother to read his articles more fully perhaps you'd realize that. The colloquial writing sometimes brings it to a more personal level that a reader can better relate to and understand -- it works especially well in this case, where it's a future design, we really don't know how it's going to perform. That he can guess and say "pretty significantly" tells me he understands the uncertainty of the situation, and the language communicates that fact perfectly well. It would be more confusing if he said it would impact performance "significantly" as you want him to, as that would imply that he was more certain than he might actually have been.Masters are allowed to bend the rules, and Anand is one, so lay off.
yyrkoon - Thursday, March 1, 2007 - link
*Is it, or is it not*/me hangs head in shame
baronzemo78 - Thursday, March 1, 2007 - link
Any rough guess as to how Barcelona will compete with Core2 in gaming? Many articles have shown how Core2 gets you a slight FPS boost in games that aren't graphics card limited. I'm curious how Barcelona will fit in with the overall picture of DX10 cards like G80 and R600.