Techniques to help an embedded system tolerate a power failure

One of the most important things we should take into account when we decide to develop a new embedded system are all possible functioning conditions where our device will be used by customers. For an user of a desktop PC is quite normal asking to execute a shutdown procedure before powering off his/her system, but for an user of an embedded system this is not true at all! In fact embedded system such as ADSL modems, wifi access points or repeaters, printers (even if some printers require to execute proper shutdown) ecc. are simply disconnected from power supply by their users without any shutdown procedure.

However these devices are all GNU/Linux based as far as our Desktop PC (I know, Windows rules on PCs! But this is true at least for my PC) so how we can assure that the file system of our embedded device and its applications will continue functioning at next boot? What can prevent a system failure in case of power failure?

Well to do so we have to consider several techniques to help the embedded system we’re currently developing in tolerate a catastrophic event as a power failure. We can use four major different solution levels.

Hardware solutions

Hardware solutions are quite expensive respect to a fully software solution but is some circumstances we cannot do without them at all! We must consider them in case our application is so important that it must continue running even in case of power failure.

A first solution is to use a battery (or a super capacitor). With a battery our system will work even without main power supply but, in order to have everything under control, we should have a special signal that informs the CPU about the power failure so the system will have the time to do a safe and controlled power off. The advantage of this solution is that we can implement it even if our system haven’t a dedicated low power modes of functioning (as reducing its main CPU clock for instance which is required in case the system should continue functioning on battery power) but it simply needs to manage an interrupt line which can be used to signal an userland process to start the shutdwon procedure when power gets low. A drawback of this approach is that batteries will require periodic testing and replacement with age in order to assure enough time to do the shutdown procedure safely.

Another important thing to do is knowing the characteristics of the storage devices we’re going to use on the system. For instance if we decide to use hard disks we should consider that some of them ignore cache flush commands from the OS, while we had cases where some models of SD cards or USB mass storage devices would corrupt themselves during a power failure, but industrial models (such as eMMC) did not have this problem (unluckily  this information was not always published in the datasheet, and had to be gathered by experimental testing). That’s why some developers decide to not use devices as SD or eMMC (or similar) but to use directly flash chips instead. This because, even if all these devices are (NAND) flash based, SD cards and eMMC chips have a dedicated and not open source controller while if we use bare NAND flash chips we can put over them a well know and open source controller. The final result is the same, a Linux block device where we can mount a file system but the way we get it is completely different!

From the developer point of view both devices are seen as block devices bused on NAND flash but the former is closed source while the latter is open source!

Bootloader solutions

To help us in system recovering can came in action the bootloader too. It usually is a piece of software complex enough to understand when the system is not able to do a correct boot procedure. In this case we can use the bootloader to do some automatic or manual activated steps that can restore a working settings.

As first thing to do is setting up our bootloader in such a way it can detect presence of a USB mass storage disk or a microSD and, if so, to boot from those devices. In this manner we can easily start an alternate system which in turn can reset our system to a well defined factory configuration (note that this procedure can also be executed by pressing a key or other techniques).

Another thing we can do is to program a watchdog system that can reset our board if not continuously refreshed due a failure condition (e.g. an unmountable root file system or a main application crash), then the bootloader, during the new reboot, can detect this condition and then starting a recovery procedure as above to resolve the problem.

Kernel solutions

When we must keep system cost low but we cannot effort expenses for a battery we still can have a good fail-save system simply by using some precautions into the kernel.

First of all we have to use a journaling file system, that is a file system that
can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but we have to choose the right one. In fact we have to select the right filesystem according to the block device we have on our system. For block devices like hard disks, SD cards, eMMC chips and USB mass storage devices we can use Linux’s Ext4 file system, while for NAND flash chips we should use something like JFFS2 or (better) UBIFS.

In any case, unless our application needs the write performance, we should disable all write caching (check isk drivers for caching options)  so consider mounting the filesystem in sync mode.

Another step that can increase system fault tolerance is trying to keep as far as possible application executables and operating system files on their own read-only partition(s) while read/write data should be on its own writable partition. Doing like this even if our application data gets corrupted, the system should still be able to boot (albeit with a fail safe default configuration).

However in some circumstances keeping the root file system read-only is not easy (for instance whhen we decide to use a standard distribution) that’s way we can decide to use some tricks to keep the root file system writable but protecting anyway our important files:

  • We can opening individual files as read-only (e.g. by using something like fp = fopen("configuration.ini", "r")).
  • We can use file mode bits and then setting important files as read-only (e.g. with chmod command used as chmod a=r configuration.ini).
  • We can initially mount a partition as read-only and then remounting it as read-write when we need to write data (e.g. by using mount -o remount,rw /).
  • We can use file system overlay to mount as read-only whole filesystem and then turning it read/write again but putting on it a transparent layer where we can put our modifications.

This last solution is very tricky since allow us to have a read-only file system with on top a read/write transparent layer which in turn allows us to see every file on lower read-only layer as it was read/writeable!

In the above figure we have a physical disk splitted into two partition: the first one holds all files needed to our system to work while the second one can be completely void. Once we mount partition 2 in read/write mode over partition 1 as read-only as overlay what we get is a logical disk mounted in read/write mode where when we write a file we put it into partition 2 while when we read it we get its contents from partition 2 if it exists there, otherwise we go in deep e we get the contents from partition 1.

The advantage of this solution is that we can wipe out all modification and return to factory settings at once just deleting partition 2!

Another solution can use, as second partition of the above example, a RAM disk so that temporary files are store into RAM. If we keep those writes off-disk we eliminate them as a potential source of corruption and we also reduce flash wear and tear. The disadvantage is that each time we reboot the system all modifications (if not saved elsewhere) will vanish.

Application solution

Even if we’re going to use whatever we saw before we still take into account that also our custom application should avoid bad operations may vanish any hardware or kernel effort.

Let’s suppose we are using a journaling filesystem on a system which has no batteries, in this situation we should think that we’re safe since, even without the battery support, our file system is safe. Well, this is certainly sure but having a journaling file system do NOT mean that we cannot loose our data! In fact if we have an important file (for example a configuration file or a database file) we periodically update and, during one of these updates, the power fails we can be sure that the file system will not be corrupted but our file will be truncated for sure!

To avoid this possibility the first thick we can do is considering to do write operations in a well defined order in such way that our data will not be lost. An example of bad code is the following:

fp = fopen("configuration.ini", "w+"));
ret = fread(buf, rsize, 1, fp);

ret = fwrite(buf, wsize, 1, fp);

While a good code is the following:

fpr = fopen("configuration.ini", "r"));
ret = fread(buf, rsize, 1, fpr);
fclose(fpr);

fpw = fopen("configuration.ini.new", "w"));
ret = fwrite(buf, wsize, 1, fpw);
if (ret != wsize) {
   
} else {
     fclose(fpw);
     ret = fmove("configuration.ini", "configuration.ini.old");
     if (ret > 0)
         ret = fmove("configuration.ini.new", "configuration.ini");
}

In this manner we are sure that in any case we have a copy of our data (that is the new modified version or the the old not-yet modified one) we can recover in case of power failure.

In order to be able to recover as gracefully as possible our system we should also maintain at least two copies of its configuration settings, a primary and a backup. If the primary fails for some reason, switch to the backup. Also we should consider mechanisms for making backups whenever whenever the configuration is changed or after a configuration has been declared good by the user. Do you have a Boot Loader or other method to restore the OS and application after a failure? 7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.

Conclusions

There are several things we can do to resolve or (better) to avoid a system failure due a power loss and they can be used in different manners so the best thing to do is using our experience to know when enhancing the hardware is mandatory or when we can solve all our problems by using a software solution only.