My first computer in 1982 was a Commodore VIC-20. In 1982, this was the “first color computer for under $300” with a price of $299.99. It had 4K of memory and a 1mhz 6502 CPU (the same chip used in the Apple 2). Captain Kirk himself was the original spokesperson:
I wrote full programs in the 3583 bytes of memory that were available to BASIC on startup. One of these years, I hope to retrieve some of the VIC-20 programs I wrote off a cassette tape and see them again via a modern emulator. But I digress…
Today, few seem to care about generating optimized code. There was a time when programmers had to worry about every last byte of memory, and every last cycle of CPU time, depending on what they were doing. But today, with gigabytes of RAM and multi-core CPUs, none of this really matters for most programmers.
But what about something like an Arduino? With only 2K of RAM and 32K flash for program storage, optimization can matter for all but the simplest programs.
Recently, I found myself trying to create a Pac-Man game for Arduino using the TVout video library. Along the way I have been documenting my efforts, including much trial and error as I come up with better ways to do things. One of the simple things I recently addressed was a way to determine the opposite direction that a game character was moving. Consider these four directions:
#define UP 0
#define LEFT 1
#define DOWN 2
#define RIGHT 3
If the player were currently going DOWN (direction 2), the reverse would be UP (direction 0). In a previous post, I detailed several different ways this could be achieved: if/then, switch/case, array table lookup, and bit flipping.
I created some test code to see what the size difference was when compiled for the Arduino, and the differences were small.
So if I really wanted to save every last byte of flash space, I might go for the version that compiled to the smallest code. However, this was a video game, and speed might be more important in some situations. I decided to write a quick program to test the differences in speed for each approach. The results might surprise you.
Here are three simple functions:
// if/then
uint8_t getReverseDir_ifthen(uint8_t dir)
{
if (dir==UP)
{
return DOWN;
}
else if (dir==LEFT)
{
return RIGHT;
}
else if (dir==DOWN)
{
return UP;
}
else if (dir==RIGHT)
{
return LEFT;
}
else
{
return dir;
}
}
// switch/case
uint8_t getReverseDir_switchcase(uint8_t dir)
{
switch(dir)
{
case UP:
return DOWN;
case LEFT:
return RIGHT;
case DOWN:
return UP;
case RIGHT:
return LEFT;
default:
return dir;
}
}
// bit flipping
// 00-UP, 01-LEFT, 10-DOWN, 11-RIGHT
uint8_t getReverseDir_bitflip(uint8_t dir)
{
if (dir & (1<<1))
{
return dir & ~(1<<1);
}
else
{
return dir | (1<<1);
}
}
Each one takes a different approach to solving the same problem. Another solution could be done without writing a function: a lookup table.
// Directions are: 0-UP, 1-LEFT, 2-DOWN, and 3-RIGHT
const uint8_t revDir[DIRS] = { DOWN, RIGHT, UP, LEFT };
reverseDir = revDir(currentDir);
Above, if the current direction is RIGHT (3), it would get the 3rd entry in the table, which is LEFT. Simple.
So which would you choose? I’d normally argue that the most straight forward and easy to understand code should win if there isn’t any significant savings to doing it in a more convoluted way. Making something confusing to save 8 bytes might not be a good idea unless you really need those 8 bytes.
So which method would you choose? Before you answer, let’s throw one more thing in the mix. On the Arduino, constant variable like strings are stored in RAM. This means printing a long string is actually using up RAM to store that string:
Serial.println("This is a really long message that will take up much RAM.");
If you have a ton of prints in your code, you may run out of that 2K quite easily. This is due to the nature of the Harvard Architecture that the Arduino uses. Read this:
http://en.wikipedia.org/wiki/Harvard_architecture
Basically, RAM and program space are in separate memory spaces and cannot be accessed the same way. Normally, constant strings like the message above would be compiled in to code space, and when it came time to use that string, it would be retrieved from code space. But, this is not how Arduino works, so strings all go in to RAM.
To get around this, there are extensions that allow storing strings and other constant values in flash, and then retrieving them through special access calls. A nice macro exists that puts a string in flash: F()
All you have to do is wrap your string with that macro, and then it goes in flash:
Serial.println(F("This is a really long message that will take up flash."));
Here is an example, with a bit of code to show free RAM:
void setup()
{
Serial.begin(9600);
Serial.println("This is a really long message. I wonder where it goes?");
Serial.println(freeRam());
}
void loop()
{
}
unsigned int freeRam()
{
extern int __heap_start, *__brkval;
int v;
return (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
}
This compiles to 2304 bytes, and when it runs, it prints the following:
This is a really long message. I wonder where it goes?
1783
Now, by putting F() around the string, it compiles to 2432 bytes and prints:
This is a really long message. I wonder where it goes?
1839
Just using the F() macro saved 56 bytes of RAM! But, the code size grew by 128, so there is some overhead.
You can specify constant data to be stored in flash by using the keyword PROGMEM like this:
PROGMEM const char msg[] = "This is a really long message. I wonder where it goes?";
PROGMEM const uint8_t revDir[DIRS] = { DOWN, RIGHT, UP, LEFT };
Do you see where I am going with this? The small directional table of DOWN, RIGHT, UP and LEFT was being stored in RAM. Those four bytes may not matter, but what if it were a larger table? By storing it in flash (since it will never change), that RAM can be saved. BUT, you cannot directly access flash storage. You have to use special functions. You can find them defined in this header file:
http://www.nongnu.org/avr-libc/user-manual/group__avr__pgmspace.html
If the table is in RAM, you can access it like you would expect:
#define UP 0
#define LEFT 1
#define DOWN 2
#define RIGHT 3
#define DIRS 4
const uint8_t revDir[DIRS] = { DOWN, RIGHT, UP, LEFT };
void setup()
{
Serial.begin(9600);
for (int i=0; i<DIRS; i++)
{
Serial.print(i);
Serial.print(" - ");
Serial.println(revDir[i]);
}
Serial.println(freeRam());
}
But if they are stored in flash, you have to get to the uint8_t bytes using “pgm_read_byte_near()” and give it the address of the byte you want to read. The change looks like this:
#define UP 0
#define LEFT 1
#define DOWN 2
#define RIGHT 3
#define DIRS 4
PROGMEM const uint8_t revDir[DIRS] = { DOWN, RIGHT, UP, LEFT };
void setup()
{
Serial.begin(9600);
for (int i=0; i<DIRS; i++)
{
Serial.print(i);
Serial.print(" - ");
Serial.println(pgm_read_byte_near(&revDir[i]));
}
Serial.println(freeRam());
}
“PROGMEM” is added before the array, and “pgm_read_byte()” is used (note the “&” to get the address) to get the actual byte.
If your program is small enough that RAM isn’t an issue, there’s really no reason to bother with this. But, if RAM is running out, perhaps putting everything you can in flash is the way to go.
Or is it?
There is a speed penalty to accessing data stored in flash. Getting back to the start of this article, I was trying to test several ways of doing the reverse direction lookup to see which one was faster. During this, I created two versions of the lookup array test — one using RAM, and the other using PROGMEM storage. I was surprised at the difference in speed. Here is what I did.
I used the millis() call to get milliseconds since the program started. Since an individual check would be too quick to detect, I run each test MAX times (a #define number, like 10000). For the actual test, I query all four directions to get the reverse of them. Since the compiler is smart enough to know when something isn’t being used, I used the keyword “volatile” over my direction variable so the compiler leaves it alone and doesn’t try to optimize this “unused” variable out.
Here is the final test code, using three functions, and the array lookup done out of RAM and flash:
#define UP 0
#define LEFT 1
#define DOWN 2
#define RIGHT 3
#define DIRS 4
#define TESTS 5
#define MAX 10000
// Directions are: 0-UP, 1-LEFT, 2-DOWN, and 3-RIGHT
const uint8_t revDir[DIRS] = { DOWN, RIGHT, UP, LEFT };
// Same, but stored in flash.
PROGMEM const uint8_t revDirPROGMEM[DIRS] = { DOWN, RIGHT, UP, LEFT };
void setup()
{
unsigned long i;
uint8_t j;
volatile uint8_t dir; // volatile, so its not optimized out
unsigned long timeStart, timeEnd;
unsigned long timeTotal[TESTS];
unsigned long timeLongest;
Serial.begin(9600);
Serial.println();
Serial.println();
Serial.print("Running each test ");
Serial.print(MAX);
Serial.println(" times...");
Serial.println();
Serial.println("0. array test");
timeStart = micros();
for (i=0; i<MAX; i++)
{
for (j=0; j<DIRS; j++)
{
dir = revDir[j];
}
}
timeEnd = micros();
timeTotal[0] = calcTimeTaken(timeStart, timeEnd);
Serial.println();
delay(1000);
/*---------------------------------------------------------------------------*/
Serial.println("1. array test using PROGMEM");
timeStart = micros();
for (i=0; i<MAX; i++)
{
for (j=0; j<DIRS; j++)
{
dir = pgm_read_byte_near(&revDirPROGMEM[j]);
}
}
timeEnd = micros();
timeTotal[1] = calcTimeTaken(timeStart, timeEnd);
Serial.println();
delay(1000);
/*---------------------------------------------------------------------------*/
Serial.println("2. if/then test");
timeStart = micros();
for (i=0; i<MAX; i++)
{
for (j=0; j<DIRS; j++)
{
dir = getReverseDir_ifthen(j);
}
}
timeEnd = micros();
timeTotal[2] = calcTimeTaken(timeStart, timeEnd);
Serial.println();
delay(1000);
/*---------------------------------------------------------------------------*/
Serial.println("3. switch/case test");
timeStart = micros();
for (i=0; i<MAX; i++)
{
for (j=0; j<DIRS; j++)
{
dir = getReverseDir_switchcase(j);
}
}
timeEnd = micros();
timeTotal[3] = calcTimeTaken(timeStart, timeEnd);
Serial.println();
delay(1000);
/*---------------------------------------------------------------------------*/
Serial.println("4. bit flip test");
timeStart = micros();
for (i=0; i<MAX; i++)
{
for (j=0; j<DIRS; j++)
{
dir = getReverseDir_bitflip(j);
}
}
timeEnd = micros();
timeTotal[4] = calcTimeTaken(timeStart, timeEnd);
Serial.println();
delay(1000);
/*---------------------------------------------------------------------------*/
// Gather some results. First, find the longest time...
timeLongest = 0;
for (i=0; i<6; i++)
{
if (timeTotal[i] > timeLongest)
{
timeLongest = timeTotal[i];
}
}
Serial.print("Longest time was : ");
Serial.println(timeLongest);
// Now see what percentage each was...
for (i=0; i<TESTS; i++)
{
Serial.print(i);
Serial.print(". ");
Serial.print(timeTotal[i]);
Serial.print(" = ");
Serial.println((float)((float)timeTotal[i]/(float)timeLongest));
}
}
/*---------------------------------------------------------------------------*/
void loop()
{
}
// Time taken!
unsigned long calcTimeTaken(unsigned long timeStart, unsigned long timeEnd)
{
unsigned long timeTaken = timeEnd-timeStart;
Serial.print("Time was: ");
Serial.print(timeStart);
Serial.print(" to ");
Serial.print(timeEnd);
Serial.print(". Total: ");
Serial.print(timeTaken);
Serial.println();
return timeTaken;
}
// if/then
uint8_t getReverseDir_ifthen(uint8_t dir)
{
if (dir==UP)
{
return DOWN;
}
else if (dir==LEFT)
{
return RIGHT;
}
else if (dir==DOWN)
{
return UP;
}
else if (dir==RIGHT)
{
return LEFT;
}
else
{
return dir;
}
}
// switch/case
uint8_t getReverseDir_switchcase(uint8_t dir)
{
switch(dir)
{
case UP:
return DOWN;
case LEFT:
return RIGHT;
case DOWN:
return UP;
case RIGHT:
return LEFT;
default:
return dir;
}
}
// bit flip
// Bits: 00-UP, 01-LEFT, 10-DOWN, 11-RIGHT
uint8_t getReverseDir_bitflip(uint8_t dir)
{
if (dir & (1<<1))
{
return dir & ~(1<<1);
}
else
{
return dir | (1<<1);
}
}
This produced the following results:
Running each test 10000 times…
0. array test
Time was: 836 to 10324. Total: 94881. array test using PROGMEM
Time was: 1030552 to 1071568. Total: 410162. if/then test
Time was: 2073064 to 2105848. Total: 327843. switch/case test
Time was: 3107384 to 3150876. Total: 434924. bit flip test
Time was: 4152396 to 4192096. Total: 39700Longest time was : 43492
0. 9488 = 0.22
1. 41016 = 0.94
2. 32784 = 0.75
3. 43492 = 1.00
4. 39700 = 0.91
As you can see, the slowest version was switch/case. The bit flip test was used 91% as much CPU time (9% faster), and the if/then test used 75% as much CPU time (25% faster). Clearly, the brute force if/then appears to be the best function.
When it comes to just using a lookup table, that uses 22% as much CPU time (78% faster!) Clearly, that is the best approach. But, that four byte table is in RAM. Four bytes isn’t very much, but if it was, we’d want to store it in PROGMEM and access those bytes using “pgm_read_byte_near()”. Doing that was the second slowest approach taking up 94% as long (only 6% faster than the slowest).
So there you go — from the fastest approach to next to the slowest just by access PROGMEM.
I think I will be using four bytes of RAM for mine.
By the way, I did some extensive research in to this last year, and created the following macros I use on most of my big programs:
#ifdef USE_FLASH
#define FLASHMEM PROGMEM
#define FLASHSTR(x) (const __FlashStringHelper*)(x)
#define FLASHPTR(x) (const __FlashStringHelper*)pgm_read_word(&x)
#else
// If not using FLASH, no special casting or keywords.
#define FLASHMEM
#define FLASHSTR(x) (x) //(const char *)(x)
#define FLASHPTR(x) (x) //(const char *)(x)
#endif
Then, by using these macros, I can create code that will compile using RAM or flash and only have to change one thing (“#define USE_FLASH”) based on how I want it to compile.
Instead of using PROGMEM, I use FLASHMEM. If USE_FLASH is not defined, this will be empty and thus the variable will go in to RAM:
byte mac[] FLASHMEM = { 0x2A, 0xA0, 0xD8, 0xFC, 0x8B, 0xEF };
And for strings, instead of having to use the F() macro, I wrap all strings in FLASHSTR():
Serial.println(FLASHSTR("This may be in RAM, or may be in flash. Who knows!"));
Again, this is so if I decide not to compile for flash, I won’t have to go back and remove all the F() macros.
And lastly, for arrays of pointers:
// Store these strings in Flash to save RAM.
const char SEstr[] FLASHMEM = "SE";
const char NOPstr[] FLASHMEM = "NO";
const char DMstr[] FLASHMEM = "DM";
...
// Create an array of pointers to Flash strings, in Flash.
const char *telnetCmd[] FLASHMEM = // 240-255
{
SEstr, NOPstr, DMstr, ...
};
Serial.print(FLASHPTR(telnetCmd[...]));
I will have to do a full article on this sometime (if I haven’t already done so – I know it was included as part of my article on writing a fully functional telnet server for Arduino).
Enjoy.
Always measure — premature ‘optimization’ is the root of all evil! :-)
Is it bad of me to have spent so much time checking out something that only saves 20 bytes? :)
And by the way, I started doing this with BASIC. I used every trick I knew of to compress and speed up BASIC, and then ran it through Carl England’s BASIC compressor which did even more cool stuff. Fun times!
Pingback: Modern ports of CoCo games. | Sub-Etha Software