Mitigation of Memory Errors on Commodity Workstations
No Thumbnail Available
Date
2023-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Bits stored in Dynamic Random Access Memory (DRAM) could flip at random instances
for various reasons such as cosmic ray incidence, electrical noise, and temperature fluctuations.
In order to handle these bit-flips, Error Correcting Code (ECC) is integrated in many
DRAM modules where such DRAMs are referred to as ECC-DRAM. One commonly used
algorithm in ECC to detect double bit-flips and correct single bit-flips is the Single Error
Correction Double Error Detection (SECDED) algorithm. However, the SECDED is only
available on ECC-DRAMs as such we implemented an optimized version of SECDED to
make it suitable on non-ECC devices. On the other hand, in order to increase the number
of bit-flip detection capabilities, we proposed a novel approach called hash-based software
ECC which uses the hash functions. Hash functions provide robust means to ensure
the integrity of data due to their deterministic nature and avalanche effect. After a bit
flip is detected through our method, a brute-force approach is used to correct the flipped
bit/bits. Our implementation of SECDED is up to 6x faster than the direct implementation
of SECDED for 1KB of data. The proposed hash-based software ECC is able to detect
any number of bit flips with an adjustable number of bit flip corrections. In this work, the
hash-based software ECC is set to correct up to 3-bit flips though it can be tuned to correct
any number of flips at a cost of performance overhead. We integrated our approach into an
in-memory database and the overhead introduced was found to be less than 3% for bit-flip
detection.