Mitigation of Memory Errors on Commodity Workstations

No Thumbnail Available

Date

2023-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Bits stored in Dynamic Random Access Memory (DRAM) could flip at random instances for various reasons such as cosmic ray incidence, electrical noise, and temperature fluctuations. In order to handle these bit-flips, Error Correcting Code (ECC) is integrated in many DRAM modules where such DRAMs are referred to as ECC-DRAM. One commonly used algorithm in ECC to detect double bit-flips and correct single bit-flips is the Single Error Correction Double Error Detection (SECDED) algorithm. However, the SECDED is only available on ECC-DRAMs as such we implemented an optimized version of SECDED to make it suitable on non-ECC devices. On the other hand, in order to increase the number of bit-flip detection capabilities, we proposed a novel approach called hash-based software ECC which uses the hash functions. Hash functions provide robust means to ensure the integrity of data due to their deterministic nature and avalanche effect. After a bit flip is detected through our method, a brute-force approach is used to correct the flipped bit/bits. Our implementation of SECDED is up to 6x faster than the direct implementation of SECDED for 1KB of data. The proposed hash-based software ECC is able to detect any number of bit flips with an adjustable number of bit flip corrections. In this work, the hash-based software ECC is set to correct up to 3-bit flips though it can be tuned to correct any number of flips at a cost of performance overhead. We integrated our approach into an in-memory database and the overhead introduced was found to be less than 3% for bit-flip detection.

Description

Keywords

Citation