How did this ever work: SPI and interrupts
Over the years, I have posted about things that fall into the category of “how did this ever work?” I think I need to make an actual WordPress Category on this blog for these items, and I will do so with this post.
Every embedded programming job I have had came with a bunch of pre-existing code that “worked.” There are always new features to be added, and bugs to be fixed, but overall things were already at an “it works, ship it” state.
And every embedded programming job I have had came with code that “works” but decided to not work later, leading me down a rabbit hole of code exploring to understand what it did and why it was suddenly did not.
This was often followed by finding a problem in the code that made me wonder how it ever worked in the first place.
SPI and interrupts on the PIC24 processor
This post is not specifically about SPI and interrupts on the PIC24 processor. I just happened to run into this particular issue in that environment.
On the seven hardware systems I maintain code for, there are various devices such as ADC, DAC, EEPROM, attenuators, phase shifters, frequency synthesizers, etc. that are hooked up to the CPU using I/O, I2C or SPI bus.
In the code main loop, the firmware reads or writes to these devices as needed, such as sampling RF detectors or making adjustments to power control attenuators.
Recently, we were testing a pulse modulation mode (PWM) where the RF signal is turning on and off. When on, the power is measured (detector read) inside an interrupt that was tied to the PWM signal. This code has been in place since before I joined 6+ years ago and, other than some bugs and enhancements along the way, “it works.”
However, this SPI code ended up being the cause of some odd problems that initially looked like I2C communications were failing. Our Windows-based host program that communicated over I2C would start having communication faults with the boards running firmware.
After a few days of Whac-A-Mole(tm) trying to rework things that could be problematic, I finally learned the root cause of the issue:
SPI and interrupts
The main loop was doing SPI operations (in our case, using the CCS PCD compiler and its API calls such as spi_init, spi_xfer, etc.). One of those SPI operations was for reading RF detectors. When PWM mode was enabled, the an interrupt service routine would be enabled and the main loop RF detectors reads would shut off and we would begin reading the RF detectors inside the interrupt routine as each pulse happened.
When a SPI operation happens, the code may need to reconfigure the SPI hardware between MODE 0 and MODE 1 transfers, for example, or to change the baud rate.
If the main code configured the SPI hardware for a specific mode and baud then began doing some SPI transfers and the pulse occurred, code would jump into the interrupt routine which might have to reconfigure the hardware differently and then read the detectors. Upon completion, execution returned back to the main loop and the SPI hardware could be in the wrong mode to complete that transaction.
Bad things would happen.
But why now?
The original code allowed pulsing at 100 Hz to 1000 Hz. This meant than 100 to 1000 times a second there was a chance that the SPI hardware could get messed up by the interrupt code. Yet, if we saw this happen, it was infrequent enough to be noticed.
At some point, I modified the code to support 10,000 Hz. This meant there was now 10,000 times a second that the problem could happen.
Over the years we had seen some issues at 10,000 Hz, including what we thought was RF interference causing communication problems (and solved through a hardware modification). Since this mode was rarely used, the true depth of the issue was never experienced.
“It works.”
Simple solutions…
A very simple solution was added which appears to have eliminated this issue completely. Some functions where added that could be called from the main loop code around any access to the SPI hardware. Here is a pseudo-code example of what I added:
volatile int g_spi_lock = 1;
void spi_claim (void)
{
interrupts_disable (INT_EXT0); // Disable PWM interrupt.
g_spi_lock = 1; // Flag SPI in use.
interrupts_enable (INT_EXT0); // Re-enable PWM interrupt.
}
void spi_release (void)
{
interrupts_disable (INT_EXT0); // Disable PWM interrupt.
g_spi_lock = 0; // Flag SPI available.
interrupts_enable (INT_EXT0); // Re-enable PWM interrupt.
}
Then, inside the interrupt service routine, that code could simply check that flag and skip reading at that pulse, knowing it would just catch the next one (I said this was the simple fix, not the best fix):
void pwm_isr (void)
{
if (0 == g_spi_lock) // SPI is free.
{
// Do SPI stuff…
}
}
Then all I had to do was add the claim and release around all the non-ISR SPI code…
spi_claim ();
output_high (chip_select_pin);
spi_init (xxx);
msb = spi_xfer (xxx);
lab = spi_xfer (xxx);
output_low (chip_select_pin);
spi_release ();
“And just like that,” the problems all went away. Many of the problems we had been unaware of since some — like doing an EEPROM read — were typically not done while a system was running with RF enabled and pulsing active. Once I learned what the problem was, I could recreate it within seconds just by doing various things that caused SPI activity while in PWM mode. ;-)
That was my quick-and-dirty first, but you may know a better one. If you feel like sharing it in a comment, please do so.
But the question remains…
How did this ever work?
Once the bug was understood, recreating it was simple. Yet, this code has been in use for years and “it worked.”
My next task is going to be reviewing all the other firmware projects and seeing if any of them do any type of SPI or I2C stuff from an interrupt that could mess up similar code happening from the main loop.
And I bet I find some.
In code that “just works” and has been working for years.
Until next time…








