Memory Error Correction

Memory Error Correction Memory Parity Errors: Causes and Suggestions
Parity Memory
Error Correcting Code Memory
Results of Mixing Parity and ECC
Error Correcting Code-Parity (ECC-P) Memory
   Requirements for Using ECC-P
   Enabling ECC-P
   Performance, ECC vs. ECC-P vs. EOS
ECC on SIMMs (EOS) Memory
   Mixing EOS and Parity?

Memory Parity Errors: Causes and Suggestions

Preface- This is written for the clone systems. Some stuff is not exactly true for microchannel systems. However, you will recognize some reasons for Traps under OS/2 and the odd memory errors that seem incomprehensible.

From M$, KB Article Q101272
Both IBM OS/2 2.x and Window NT seem to experience problems which appear to be associated with system memory in some circumstances. It can be frustrating to have a system that is able to run DOS, Windows 3.1 or OS/2 1.x and suddenly find it cannot run Windows NT due to this problem. The first issue to clear up is that not all NMI errors are due to memory. Other boards in the system can cause this problem and components directly on the system motherboard can be at fault. When memory is at fault, it is usually for the following reasons:

1. Memory not functioning at the specified access rate as required by system board.
If the system specification calls for 80 ns access rate, Windows NT most likely fails if memory is accessing at a slower rate such as 90 ns. Even though the chips may be marked as 80 ns, in testing, some fail to meet this access rate. Quite often memory chips run at a slower speed when they reach operating temperature. This produces an effect called "speed drift." The symptoms are a system which runs Windows NT when first turned on; however, after 15 minutes or so, the system starts having memory errors. A high quality SIMM tester can cycle the chips through various voltage and heat cycles, so this is fairly easy to see.

2. Memory meets specs, but speeds are different between SIMMs.
The average access rate may be 70 ns on one SIMM module while the next is running at 60 ns. We have found SIMMs stamped at the factory to be rated at a 70 ns average access rate to actually be running as fast as 50 ns. Although the SIMMs are obviously well under the system required access specification, the difference of 10 ns or more between them can often cause problems on some systems. An interesting note here is that you can move these to a different system board which is using a different BIOS and chip set, and it may not have any memory problems. This is because each BIOS and chip set regulate the "refresh wait states" used for timing, and this difference often allows for variance in speed to be acceptable. If your system's BIOS allows you to adjust the "wait states" for memory refresh, this often will allow the system to run with SIMMs or DRAM memory chips which are running at different access rates. The downside to increasing the number of wait states is a slower system.

3. Individual chips on SIMM module run at different access rates.
A difference of 10 ns or more between bits has been known to cause problems. This once again can be regulated somewhat by the BIOS and chip set of the system board if it allows you to lengthen the refresh wait states for memory access.

4.One of the memory chips is being affected by "cell leakage."
This is a true parity error and is also known as a "soft error." This occurs when the change in the state of an individual cell (a zero or one) electrically leaks into a neighboring cell changing it's state. When the memory is read back, it no longer matches the parity bit's checksum value and an NMI is issued to the processor signaling a parity error has occurred. This memory SIMM must be replaced. If problems persist with replacement chips, there is quite possibly a voltage or heat anomaly occurring with the socket or circuitry which is damaging the chips.

5.Cache memory is another thing to suspect.
In some instances the Cache memory access rates were too slow and caused enormous problems. On most Intel-based 486 computers, a 15 ns to 25 ns is normal. You will most likely have problems if it is slower than 25 ns. The system manufacturer can provide the specifications and locations of these chips.

In general, you should first carefully clean the system of dust. This includes the areas allowing ventilation so that heat does not build up abnormally. The contacts of all boards and SIMMs should be cleaned. You can use the eraser of a pencil to do this, thus ensuring good contacts.
Ed. Uh, keep your fingers off the contacts in the first place. Never saw a super grungy PS/2 SIMM yet. Dusty, hell yes...

From Dr. Jim
> I think I would try to clean the contacts on all the removable memory with a pencil eraser, corrosion may be causing some flaky electrical connections.

Not a good plan, if the contacts are gold plated. A pencil eraser will strip away some of the gold. The edges of a dollar bill, or a chunk of good quality bond typing paper, folded over and rubbed briskly over the contacts seems to work very well.

Continued...
   Be certain that all boards are firmly seated in their slots or sockets. It may be necessary to replace old cabling which may degrade over time and under high temperatures. Power supplies can also cause many problems, thus, if possible, have the output voltages checked. Monitors can cause strange behaviors on your system as well. It is also highly recommended that computers be placed on some type of Surge Suppression power strip since after a power outage occurs, the return of power back on is usually a fairly high surge and can permanently damage sensitive electrical components of your system.
   Ed. PS/2 systems do not need to be placed on surge suppressors. The peripheral equipment (monitors, printers, scanners, etc) does benefit from the surge suppressor, however.
   UPS need to be TRUE SINE WAVE! Modified sine wave is NOT acceptable. Modified sine wave UPSs will give you random PS/2 power supply power down, then power up and the UPS will never trip Since the incoming AC power is OK, the UPS won't trip, but the modified sine wave out to the PS/2 power supply will trigger random power cycling.

Parity Memory
Parity memory is standard IBM memory with 32 bits of data space and 4 bits of parity information (one check bit/byte of data). The 4 bits of parity information are able to tell you an error has occurred but do not have enough information to locate which bit is in error. In the event of a parity error, the system generates a non-maskable interrupt (NMI) which halts the system. Double bit errors are undetected with parity memory.

Error Correcting Code Memory
   Traditionally, systems which implement only parity memory halt on single-bit errors, and fail to detect double-bit errors entirely. Clearly, as memory is increased, better techniques are required.
   One technique to deal with double-bit errors is Error Correcting Code (or sometimes Error Checking and Correcting). ECC can detect and correct single bit-errors, detect double-bit errors, and detect some triple-bit errors.
   ECC works like parity by generating extra check bits with the data as it is stored in memory. However, while parity uses only 1 check bit per byte of data, ECC uses 7 check bits for a 32-bit word and 8 bits for a 64-bit word. These extra check bits along with a special hardware algorithm allow for single-bit errors to be detected and corrected in real time as the data is read from memory.

The data is scanned as it is written to memory. This scan generates a unique 7-bit pattern which represents the data stored. This pattern is then stored in the 7-bit check space.
As the data is read from memory, the ECC circuit again performs a scan and compares the resulting pattern to the pattern which was stored in the check bits.

   If a single-bit error has occurred (the most common form of error), the scan will always detect it, automatically correct it and record its occurrence. In this case, system operation will not be affected.
   The scan will also detect all double-bit errors, though they are much less common. With ouble-bit errors, the ECC unit will detect the error and record its occurrence in NVRAM; it will then halt the system to avoid data corruption. The data in NVRAM can then be used to isolate the defective component.
   In order to implement an ECC memory system, you need an ECC memory controller and ECC SIMMs. ECC SIMMs differ from standard memory SIMMs in that they have additional storage space to hold the check bits. The Server 95 ECC support views memory in 1MB segments and has the ability to deallocate a failing segment.

Results of Mixing Parity and ECC Memory
From Stephan Goll
My box (95A) showed the showed the expected memory error. I didn´t know that I mixed ecc and parity (bankwise), so I ran the memory tests. This procedure told me what I has been doing, disabled the ecc-equipped banks, and the box after that ran fine with reduced memory. I believe the reason is that the first bank rules the type of ram the box wanted to see.
Btw, I realized that the memory in the first bank is tested more intensive then in other banks, because I have failing mem-modules, but they work very well in one of the other banks, even in the mem-tests and under linux.

Error Correcting Code-Parity (ECC-P) Memory
Previous IBM servers such as the 9585 were able to use standard memory to implement what is known as ECC-P. ECC-P takes advantage of the fact that a 64-bit word needs 8 bits of parity in order to detect single-bit errors (one bit/byte of data). Since it is also possible to use an ECC algorithm on 64 bits of data with 8 check bits, IBM designed a memory controller which implements the ECC algorithm using the standard memory SIMMs.
The following shows the implementation of ECC-P. When ECC-P is enabled via the reference diskette, the controller reads/writes two 32-bit words and 8 bits of check information to standard parity memory. Since 8 check bits are available on a 64-bit word, the system is able to correct single-bit errors and detect double-bit errors just like ECC memory.

While ECC-P uses standard non-expensive memory, it needs a specific memory controller that is able to read/write the two memory blocks and check and generate the check bits. Also, the additional logic necessary to implement the ECC circuitry make it slightly slower than true ECC memory. With the Server 85 ECC-P implementation, the system views memory as matched pairs of SIMMs and, in case of a double bit failure, will deallocate both SIMMs in a matched pair. With the price between standard memory and ECC has narrowed, IBM no longer implements ECC-P.

NOTE! Parity and ECC-on-SIMM memory can not be installed within the same system. [ed. untested]

Requirements to use ECC-P
You have to use matched pairs of memory SIMMs in order to use ECC-P.The 9585 is the only PS/2 that supports ECC-P. With matched pairs, the 9585 supports ECC-P for the entire amount of supported system memory (256MB max on K and N models, 64MB on X models). Unmatched SIMMs can be installed in the Server 85-xXx ONLY, however, ECC-P can be turned on for matched pair SIMMs only. For Model 85-xXx ONLY unmatched SIMMs will be run as normal parity memory if they are installed.

Enabling ECC-P
ECC capability can be turned on or off without changing any hardware, memory, switches, or opening the cover; enabled or disabled via menus on the System Partition (Ref Diskette). You may select memory support from a memory checking method item on the system configuration menu screens. This option allows the user to choose between ECC-P or normal parity operation.

Performance Degredation
ECC-P detection and correction takes place in the memory controller rather than in the memory SIMM as on the Base 3 and 4 Processor Complex of the Model 95. If ECC-P is enabled, it will cause up to a 14% performance degradation compared to the more efficient Base 3 and 4 Processor Complex (Model 95) which is only 3%. This performance degradation is only for the memory subsystem, not for the total throughput. (Ed. I have heard a few reports (retorts) that it's a little more noticeable than that...)

As previously discussed, systems which employ ECC memory have slightly longer memory access times depending on where the checking is done. It should be stressed that this affects only the access time of external system memory, not L1 or L2 caches. The following table shows the performance impacts as a percentage of system memory access times of the different ECC memory solutions.

Again, these numbers represent only the impact to accessing external memory. They do not represent the impact to overall system performance which is harder to measure but will be substantially less.

1 SIMM MEMORY
CONTROLLER IMPACT TO ACCESS TIME

ECC X X 3%

ECC-P X 14%

EOS X None

ECC on SIMMs (EOS) Memory
    ECC On SIMM (EOS) is memory with the ECC logic function completely contained
on the SIMM. This differs from true ECC, where the planar or complex memory controller
provides the ECC logic. EOS provides detection and correction of any single-bit error in each
byte of SIMM data before the data leaves the SIMM. EOS can upgrade parity based
systems to a fully functional single-error-correct (SEC) ECC system.
   EOS appears to a system like normal, 36 bit wide, 72 pin FPM. 4MB EOS SIMMs
use IBM presence detect, and the 8MB (Tall and Wide), 16MB and 32MB EOC SIMMs
have industry standard presence detects. All EOS uses gold tabs.
   The only PS/2 systems that can use the 16 and 32MB EOS are the 9585 K/N. With these systems, leave the memory checking in System Programs as Parity. Do not enable ECC-P (software based ECC) because it's redundant and slows the system down.

Mixing EOS and Parity?
Not so, according to the Wunderkind Peter Wendt
>Can I use both EOS and FastPage SIMMs with parity together in a PC-Server 320 ? Has anyone tested it ?

   Ahem ... not that I knew of. The EOS is ECC-On-SIMM ... basically a workaround to use some ECC sort of error detection on a systemboard that is originally designed for Parity only (like the crappy Micronics board in the 320 and 520) and the technical basis of it all is still Parity ....
   But: the BIOS differs among the two. EOS is detected as ECC and Parity ... well ... as Parity. This however triggers two different routines for the error handling. The EOS memory is capable to catch single bit failures on its own and only signals a corrected bit failure to the systemboard logic (and then further to the POST code at next power on, respectively to the failure report and to e.g. Netfinity Manager), while Parity just dumps into a blue screen or NMI error routine of any kind.
   I have a 520 - an 8641-MZV with 128MB of EOS memory. I tried using the 8 32MB Parity modules I got for my Server 85 9585-0NG - and they did not work (very well - wonder why). Trying to "mix and match" 2 sets of each caused a power on error 200-something right from the start.

9595 Main Page