http://anandtech.com/memory/showdoc.aspx?i=2223&p=6
Memory Latencies Explained
One big question that remains is latency. All the bandwidth in the world will not help if you have to wait forever to get the needed data. It is important to note, however, that higher latencies can be compensated for. The Pentium 4, for example, has improved buffering, sophisticated prefetch logic, and the ability to have many outstanding memory requests. It loves bandwidth, and performance has been helped substantially by increasing the bus speeds, even with higher memory latencies. Graphics chips also tend to be more forgiving of higher latencies. Any design can be modified to work with higher or lower latencies, of course; it is but one facet of the overall goal which needs to be addressed. Still, the question remains, how does memory latency relate to timings and bandwidth?
The simple answer is that it is directly related to the memory timings, but you cannot compare timings directly. The reason for this is that the memory timings are relative to the base clock speed of the RAM - they are the number of memory clock cycles that each operation requires. For DDR memory, this means that the cycle time is calculated using one half of the data transfer speed. PC3200 DDR memory has a 64-bit bus that transfers up to 3200 MB/s. Converting that to a clock speed means converting bytes to bits (multiply by eight), then divide by that bus width, and we get the effective clock speed; the base clock speed is half the effective clock speed.
PC3200:
3200 MB/s * 8 bits = 25600 Mb/s
25600 Mb/s / 64-bits = 400 MHz
400 MHz / 2 = 200 MHz base clock speed
Other memory types may use quad or even octal data rates, but if we convert those into the base clock speed, we can compare latencies. Where timings are listed in clock cycles, latency is listed in nanoseconds (ns). A CL of 2.0 sounds better than a CL of 5.0, but depending on the memory clock, it may actually be closer than we would at first expect. By converting all of the timings into nanoseconds, we can compare performance. We will save detailed comparisons for the next installment, but as an example, suppose we have two memory types - one with a CL of 4.0 and a base clock speed of 333 MHz, and the second with a CL of 2.5 and a base clock speed of 200 MHz.
-----------------------------------------------------
CL Clock Speed Cycle Time Real Latency
2.5 200 MHz 5.0 ns 12.5 ns
4.0 333 MHz 3.0 ns 12.0 ns
-----------------------------------------------------
In this specific example, we see that even with a CL that's 60% higher, the effective latency can actually end up being slightly slower. This is something that we will examine further in the next article of this series.
An Anecdote
Getting the whole picture of how memory performance impacts system performance is still a very difficult task. If all this talk of timings and latencies has not helped, let us provide another comparison. Think of the CPU as a cook at a restaurant, busily working to keep up with customer demand. There is a process that occurs. Waiters or cashiers take the orders and send them to the cook, the cook prepares the food, and the final result is delivered to the customer. Sounds simple enough, right? Let's look at some of the details.
When an order for a dish comes in, certain common items (e.g. fries, rice, soup, salads, etc.) may already be prepared, so delivering them to the customer occurs rapidly. We can think of this as the processor finding something in the L1 cache. This is great when it occurs, but it only occurs for a very limited number of items. Most of the time, the cook will need to begin preparing the order, so he will get the items from the cupboard, freezer and refrigerator and begin cooking them. This time, the ingredients are in the L2/L3 cache. So far so good, but where does RAM come into play?
As items are pulled from the fridge, freezer, etc., the restaurant will need to restock them. The supplies have to be ordered from headquarters or whomever the restaurant uses. This is akin to system RAM (or maybe even the hard drive, but we'll leave that out of the analogy for now). If the restaurant can anticipate needs properly, it can order the supplies in advance. Sometimes, though, supplies run low - or maybe you didn't order the correct amount of supplies - and you need to send someone off to a local store for additional ingredients. This is a cache miss, and the store is the system RAM. In a time-critical situation such as this one, the cook wants the ingredients ASAP. A closer store would be better, or perhaps a store with faster checkout lanes, but provided that the trip does not take a really long time, any store is about as good as another. Basically, system RAM with its timings and latencies can have an impact, but a really fast memory controller (i.e. a store next door) with slower RAM (slow checkout lanes) can be more important than having the fastest RAM in the world.
This is all well and good for smaller restaurants and chains, but a large corporation (e.g. McDonald's) cannot simply walk next door to pick up some frozen burgers. In this case, the whole supply chain needs to be highly efficient. Instead of ordering supplies once a week, inventories might be checked every night, and orders placed as necessary. Headquarters has forecasts based on past requirements and may send orders to their suppliers months in advance. This supply chain correlates loosely with the idea of outstanding memory requests, prefetch logic, deeper buffers, etc. Bandwidth also comes into play here, as a large chain might have several large trailers of supplies en route at any point in time, while a smaller chain might be able to get by with only one or two moderately-sized delivery vans.
With faster processors, faster buses, faster RAM, etc., the analogy is moving towards all processors being large corporations with huge demands. Early 8088 and 8086 processors could just wander to the local store as necessary - like what most adults do for their own cooking needs. As the amount of data being processed increases, though, everything becomes exponentially more difficult. There is a big jump from running one small restaurant that serves a few dozen people daily to serving hundreds of people daily, to running several locations, to running a corporation that has locations scattered across the world. That is essentially what we have seen in the world of computer processors. We have gone from running a local "mom-and-pop" burger joint to running McDonald's, Burger King, and several other hamburger chains.
This analogy is probably flawed at numerous levels, but hopefully it helps. If you think about it, the complexity of any one subsystem of the modern PC is probably hundreds of times greater than that of the entire original IBM PC. The change did not occur instantly, but even the largest of technology corporations are going to have a lot of trouble staying at the top of every area of computers.