SoftHere will fail
We want to build bug-free systems, but bugs will creep in, especially as our products grow in size. What can we do to handle those bugs in a customer friendly manner? And, secondarily, how can we trap the problems early, leaving debugging breadcrumbs behind? Here’s a summary of ideas. But before that, please don’t complain me about the length and breadth of this post. This was logged for my future references. If you don’t understand anything, never mind, we’ll discuss something else.
Orginal Title: SW will fail.
source: Ganssle.com
1. Use a safe language! C and C++ are messes that allow us far too much freedom to create flawed products. Ada eliminated most of these problems… but few projects use Ada anymore. Cyclone, an x86 compiler hosted under Linux is a variant of C that promotes inherently safe programs. As far as I know, as yet it’s not appropriate for most embedded systems due to lack of ports. But what a great idea! These alternative languages and dialects promote safety in a variety of ways, not the least of which is built-in runtime exception checking.
2. Consider a layer of “middleware” to capture dumb stuff the program does. One of the intriguing aspects of Java is the virtual machine, which does runtime error checking (among other things). Ed Sutter’s free MicroMonitor is an execution environment that includes a flash file system, networking and more. It also captures crashes and has a scripting language that let’s you program appropriate actions in case of a failure. Also see his excellent book (Embedded Systems Firmware Demystified).
3. Use a dynamite Watchdog Timer to reset crashed programs, and to signal to the user, or the developer, that a problem was found. Experts warn against kicking the dog in an ISR. There are stories about spending too much debug time finding code that crashed simply because it does not kick the WDT often enough. It’s an indictment of the system design rather than a problem with watchdogs, but it’s something to look out for. A neat trick that lets a safely service: A WDT inside an ISR. Always use a two-level scheme for servicing a WDT. Since you don’t need to sprinkle watchdog accesses all over the code, and its not required to count the cycles in your code so that you don’t inadvertently add too many lines between WDT services, You can have an interrupt service routine that services the WDT at a high enough rate, but only if a flag has been set by the foreground code. If the flag is not set, the ISR will continue to service the WDT for only about 10 seconds (or whatever time the designer chooses), at which time it will quit servicing the WDT and allow it to reset the CPU. The ISR clears the flag whenever it detects the flag is set. So all you need to do is set the flag periodically in your foreground code, usually at only one place in the main executive loop, or in other long loops such as polling for serial input, etc. Then if the foreground code gets stupid, it will stop setting the flag, and the background will eventually give up and let the WDT do its thing to bring the system back on track. Several other ideas are put in here. http://embedded.com/2003/0301/, http://embedded.com/2003/0302/, and http://embedded.com/2003/0303/.
4. Implement exception handlers… and test each one extensively. Something like 2/3 of all system crashes can be traced to unimplemented or poorly tested exception handlers.
5. Use the memory management unit, if your CPU has one, to isolate tasks from each other. A task crash will throw an exception; the code can recover from the problem or at least log debugging info.
6. Reset output ports frequently. There are hardware that will die horribly if a port assumes an incorrect value for even a microsecond. It’s hard to generalize about embedded systems as they come in so many flavors; use these ideas only if appropriate for your system. The nature of ESD-generated port transients can trouble you for sometime. Hehe… It seems that an ESD transient sometimes just resets the output latch in the processor. A read-back will most often show the latch in a bad state.
7. Apply sanity checks on any input which might possibly be incorrect. In the embedded world, don’t forget to check hardware inputs, like those from an A/D, for values that are “impossible”. (Rule of thumb: When we say “impossible” we often really mean “unexpected”. Expect the unexpected.)
8. Measuring execution time of tasks can keep away some bugs. If a task which should have taken 30 usec say, actually burned more like a msec. That indicates either a bug, or a lack of understanding about the environment. Either is a problem.
9. Fill unused memory with single byte or one-word software interrupts. Crashes often cause the code to wander off; the interrupt, plus exception handler, can capture the crash and take action.
An alternative: Filling with NOPs. At the end of memory put a jump to the diagnostic routine.
10. Have built-in tests, and implement loop-back checking wherever possible. Doing a comm link? Why not have a software-controlled loop-back that lets the code test the hardware?
11. Set all unused interrupt vectors to point to an exception handler. All sorts of problems (hardware glitch, misprogrammed vector or peripheral, software crash) can create an interrupt one did not expect. This is an easy and cheap way to capture such an event.
12. Always think through possible overflows in arithmetic calculations, especially with integer calcs. Adding two large positive numbers can overflow, resulting in a negative. C, of course, is perfectly happy to return this meaningless computation. Assembly macros that return, for a 16 bit computation, 32767 when there’s an overflow. The number is not a correct result, but it’s more correct than a negative value.
13. put constants on the left side of an equation when doing compares. Mix up = and ==, and the compiler will issue a warning.
14. Always look at the return values from functions, so the caller can tell if the function worked. Good point! For example, I *never* see malloc’s return value tested to see if the allocation actually worked. We just assume everything will be OK.
15. Check out http://citeseer.nj.nec.com/maxion98improving.html; it’s quite thought-provoking.
16. Use assert()s? Assert() macro is a powerful way to detect all sorts of errors. A step further, Create an “ensure” macro like:
void Ensure(bool_t condition, jmp_buf jb)
{ if (condition == FALSE){
MarkInternalError();
longjmp(jb,-1); }
}
This is preferred over the ASSERT() since the ASSERT() can be switched off. jump_buf is a local var instead of a global one, because you can then specify more than one return point in the code.
17. It’s important to check the stack size. The traditional method is to fill each stack with 0×55AA, run the program for a while, and then use a debugger to see how many of these are left. Max has an alternative: create a chunk of code you examine in each ISR for stack growth, something like: U16_t get_stack_bytes_left() { _asm mov ax, sp; }. (That’s clearly x86 and perhaps compiler-specific, but you get the idea). This really appeals to me – it costs a few microseconds and some memory, but can be left in production versions to monitor for “unexpected events”.
18. Never use checksums; CRCs are better and almost as easy to code.
19. Have the software confirm hardware actions. Example: after closing a relay, use an A/D to monitor something to be sure the contacts are really closed.
20. Back up data into nonvolatile memory? Use a CRC or duplicate memory area to guarantee the data is not corrupt.
21. This is about managing data types. By placing a type as the first field in the struct, you have self-describing data. This has a few implications:
- If you send the struct as a message, you know how to crack the rest of the message – i.e., it is an explicit message type. I had a gig where we took the RPC tool that lets you create byte-order independent structs and modified it to automatically create these structs with the types auto-generated.
- if you have the case:
typedef struct
{
UINT16 type;
} S;union
{
S s;
FOO foo;
BAR bar;
BLETCH bletch;
UINT8 data[ MAX_DATA_SIZE ];} u;
- If you read in the message data (or pick it up from some source) all you need to is read in the first 2 bytes into u.data[] (byte-order is assumed known) or if there is a size associated with the read, then your code can look like:
switch( u.s.type )
{
case TYPE_FOO:
… crack u.foo …
break;
case TYPE_BAR:
… crack u.bar …
break;
case TYPE_BLETCH:
…. crack u.bletch …
break;
}
- If you want to have “persistent objects” where the object is your data, then it is easy to walk the data structs, with one function pair (write/read) per struct, and using some form of non-volatile memory (i.e., a text file) for persistence.
- Buffer overrun is always an issue. A simple way to deal with this is to create:
typedef struct
{
UINT16 type; // TYPE_STR
int len; // number of bytes in the buffer
int used; // needed if a data buff and not ASCII string
UINT8 *buf; // pointer to a fixed buffer
} STR;
- Create a buffer pool that *buf points to, and set the pointer when you “instantiate” the STR variable. You can even do bounds checking on *buf to make sure it points into the buffer pool. You would need to create a replacement set of functions for sprintf(), memcpy, strlen(), strcpy(), etc. However, these are fairly simple to do.