Crash tolerance is a new (as of release 1.21) feature that can be
enabled at compile time, and used in environments with appropriate
support from the OS and the filesystem. As of version
1.24, this means a Linux kernel 5.12.12 or later and
a filesystem that supports reflink copying, such as XFS, BtrFS, or
OCFS2. If these prerequisites are met, crash tolerance code will
be enabled automatically by the configure
script when
building the package.
The crash-tolerance mechanism, when used correctly, guarantees that a
logically consistent (see Database consistency) recent state of
application data can be recovered following a crash. Specifically, it
guarantees that the state of the database file corresponding to the
most recent successful gdbm_sync
call can be recovered.
If the new mechanism is used correctly, crashes such as power
outages, OS kernel panics, and (some) application process crashes
will be tolerated. Non-tolerated failures include physical
destruction of storage devices and corruption due to bugs in
application logic. For example, the new mechanism won’t help if a
pointer bug in your application corrupts GDBM
’s private in-memory
data which in turn corrupts the database file.
In the following sections we will describe how to enable crash tolerance in your application and what to do if a crash occurs.
The design rationale of the crash tolerance mechanism is described in detail in the article, Crashproofing the Original NoSQL Key-Value Store, by Terence Kelly, ACM Queue magazine, July/August 2021, available from the ACM Digital Library. If you have difficulty retrieving this paper, please contact the author at tpkelly@acm.org, tpkelly@cs.princeton.edu, or tpkelly@eecs.umich.edu.