Barcelona Architecture: AMD on the Counterattack
by Anand Lal Shimpi on March 1, 2007 12:05 AM EST- Posted in
- CPUs
New Prefetcher
Prefetching is done in many areas of the system and by many different components. When NVIDIA introduced its nForce2 chipset, it stressed the ability of its intelligent prefetcher to make use of a very wide, at the time, 128-bit memory bus. More recently, when Intel introduced its Core 2 processor family it stressed the importance of its three prefetchers per core in drastically reducing perceived memory latency.
AMD's K8 core had two prefetchers per core - one instruction and one data. The Barcelona core still retains the same number of prefetchers, but improves on them. The biggest change is that the data prefetcher now brings data directly into the L1 data cache, as opposed to the L2 cache in the K8. AMD looked at the accuracy of its core prefetchers and realized that they were doing quite well, so it only made sense to prefetch into a low latency L1 and avoid polluting the L2 cache. AMD has also increased the flexibility of its L1 instruction cache prefetcher to handle two outstanding requests to any address.
At first glance it looks like Intel's prefetchers in Core 2 are greater, at least in quantity, than what AMD has planned even for Barcelona. Remember that Intel's Core 2 processor features two data and one instruction prefetcher per core, plus an additional two L2 cache prefetchers, all of which are well managed as to not eat into "demand" bandwidth. At the same time, we must keep in mind that Intel needs these prefetchers to help mask its longer trip to main memory. From a CPU perspective, the advantage here is for Intel, but as a platform the true winner is tough to determine.
Each Barcelona core gets its own set of data and instruction prefetchers, but the major improvement is that there's a new prefetcher in town - a DRAM prefetcher. Residing within the memory controller where AMD previously never had any such logic, the new DRAM prefetcher takes a look at overall memory requests and attempts to pull data it thinks will be used in the future. As this prefetcher has to contend with the needs of four separate cores, it really helps the entire chip improve performance and can do a good job of spotting trends that would positively impact all cores. The DRAM prefetcher doesn't pull data into the CPU's L2 or L3 caches either; instead it features its own buffer to avoid polluting the caches. The buffer is approximately 20 - 30 cache lines in size and happens to be the same buffer that is used for Barcelona's write bursting we mentioned on the previous page.
Prefetching is done in many areas of the system and by many different components. When NVIDIA introduced its nForce2 chipset, it stressed the ability of its intelligent prefetcher to make use of a very wide, at the time, 128-bit memory bus. More recently, when Intel introduced its Core 2 processor family it stressed the importance of its three prefetchers per core in drastically reducing perceived memory latency.
AMD's K8 core had two prefetchers per core - one instruction and one data. The Barcelona core still retains the same number of prefetchers, but improves on them. The biggest change is that the data prefetcher now brings data directly into the L1 data cache, as opposed to the L2 cache in the K8. AMD looked at the accuracy of its core prefetchers and realized that they were doing quite well, so it only made sense to prefetch into a low latency L1 and avoid polluting the L2 cache. AMD has also increased the flexibility of its L1 instruction cache prefetcher to handle two outstanding requests to any address.
At first glance it looks like Intel's prefetchers in Core 2 are greater, at least in quantity, than what AMD has planned even for Barcelona. Remember that Intel's Core 2 processor features two data and one instruction prefetcher per core, plus an additional two L2 cache prefetchers, all of which are well managed as to not eat into "demand" bandwidth. At the same time, we must keep in mind that Intel needs these prefetchers to help mask its longer trip to main memory. From a CPU perspective, the advantage here is for Intel, but as a platform the true winner is tough to determine.
Each Barcelona core gets its own set of data and instruction prefetchers, but the major improvement is that there's a new prefetcher in town - a DRAM prefetcher. Residing within the memory controller where AMD previously never had any such logic, the new DRAM prefetcher takes a look at overall memory requests and attempts to pull data it thinks will be used in the future. As this prefetcher has to contend with the needs of four separate cores, it really helps the entire chip improve performance and can do a good job of spotting trends that would positively impact all cores. The DRAM prefetcher doesn't pull data into the CPU's L2 or L3 caches either; instead it features its own buffer to avoid polluting the caches. The buffer is approximately 20 - 30 cache lines in size and happens to be the same buffer that is used for Barcelona's write bursting we mentioned on the previous page.
83 Comments
View All Comments
agaelebe - Friday, March 2, 2007 - link
Wow! A lot of dicussion in here.And, by the way, very interesting article.
I'm a software engineer from Brazil and I'm planning to change my PC this year.
I've bem using AMD processors since the K6.
Today I've a XP Mobile 2500+(@2.2ghz), 1gb ram, 200gb and an AGP 6600GT
My PC is not very slow, but I'm thinking in going dual core to speed things up(office applications, web development and some games).
I can run some of the newest games, but not in high graphics.
I expect that my PC can run C&C 3 (Already run the demo in 1024 medium, but have some craches although it's not running it slow)
So, today I'm thinking in 3 options:
1) Stay with this computer and wait until AMD launchs it's new architecture (I pretend to go with an average price Kuma)
2) Go with Intel Core 2 Duo (e6300 or e6400). They're not expensive and for games I can easily make an overclock and gain more performance.
3) Buy a good AM2 board and a cheap Atlhon X2 (3600) and wait new AMD processors and then change only the processor.
Here in Brazil the taxes are to high, so I'm planning in buying a PC with these specs:
- CORE 2 Duo e6300/6400 or X2 3600/3800
- mid-tier motherboard (
- 2 x 1gb DDR 800 4-4-4-12
- 2 x 250 gb
- X1950pro 256 or 512
- 500watts power
So the prices are below:
e6300 box US$ 300 (same price for a X2 4200+ box)
x23800 box US$ 220
motherboard: US$ 220
ram: US$ 400
video: US$ 450
DVD: US$ 70
case: US$ 150
HDs : US$ 250
Power: us$ 180
So I plan to spent about 2000 dollars (Sadly, I can buy this same PC in US for the half of the price).
My new PC should spent not to much power so I can leave it turned onall day long(max 150watts on iddle without monitor), otherwise I'll keep my old computer turned on just for downloding stuff)
So, If someone has an opinion, I'd like to "hear" it. You can give another options to, or make some comments about the specs I'm choosing now.
I had Pentium 75 and after that only AMD CPUs... Should know I surrender to the Core 2 Duo or believe that AMD can really beat it until the end of 2008?
And thanks for the cooperation and patience.
Zebo - Saturday, March 3, 2007 - link
Athlon 64 AM2's arnt exactly slow so if you're an AMD fan just get one..like a 3800+ or 3600+ and overclock it. It will be at least 4x faster than what you have now and accept K8L Agena core later. It will be cheaper than C2D by about $50 USD and You'll also pay cheap for a GeForce 6100 Motherboard which is only $50 USD. Overall expect the the AM2 system to be about $100 USD cheaper.Keep in mind that C2D is 20% faster clock for clock in most apps so it's not exactly a quantum leap here getting a C2D.. Gap gets a lot larger when overclocking since C2D's overclcok higher like 3.2Ghz is common on air vs. only 2.8Ghz for AM2, so, at the end of the day a C2D setup is able to be about 40% faster over most benchmarks. That is getting significant and why enthusiasts are buying C2D's.
agaelebe - Friday, March 2, 2007 - link
And,as always, sorry with the errors and not so good writing...Kiijibari - Thursday, March 1, 2007 - link
Hi,never heard of of that before, does anybody know what it is ?
So far I see 2 pad areas at the DIE photo, therefore I assume that it would be also 2 interfaces, e.g. x8 PCIe like Sun uses ?
bb
Kiijibari
mino - Friday, March 2, 2007 - link
It should be some management/coodrination stuff (can-t remember the name of that bus).Every northbridge and CPU has that.
davecason - Thursday, March 1, 2007 - link
Anand,Great article! I know it took a lot of time and I wanted you to know I really appreciate your effort. It is the kind of article that keeps me coming back to your site.
-Dave
yyrkoon - Thursday, March 1, 2007 - link
Page 5, paragraph 4 'pretty significantly'. Well is it, or is it not it ?
http://www.wikihow.com/Avoid-Colloquial-%28Informa...">http://www.wikihow.com/Avoid-Colloquial-%28Informa...
Aside from my gripe concerning writing style, good article :)
trisweb2 - Friday, March 16, 2007 - link
Usually we criticize writing style based on a whole experience... obviously Anand is one of the best technical review writers on the Internet; if you bother to read his articles more fully perhaps you'd realize that. The colloquial writing sometimes brings it to a more personal level that a reader can better relate to and understand -- it works especially well in this case, where it's a future design, we really don't know how it's going to perform. That he can guess and say "pretty significantly" tells me he understands the uncertainty of the situation, and the language communicates that fact perfectly well. It would be more confusing if he said it would impact performance "significantly" as you want him to, as that would imply that he was more certain than he might actually have been.Masters are allowed to bend the rules, and Anand is one, so lay off.
yyrkoon - Thursday, March 1, 2007 - link
*Is it, or is it not*/me hangs head in shame
baronzemo78 - Thursday, March 1, 2007 - link
Any rough guess as to how Barcelona will compete with Core2 in gaming? Many articles have shown how Core2 gets you a slight FPS boost in games that aren't graphics card limited. I'm curious how Barcelona will fit in with the overall picture of DX10 cards like G80 and R600.