This came up at my day job when two programmers were trying to get a block of data to be the size both expected it to be. Consider this example:
typedef struct
{
uint8_t byte1; // 1
uint16_t word1; // 2
uint8_t byte2; // 1
uint16_t word2; // 2
uint8_t byte3; // 1
// 7 bytes
} MyStruct1;
The above structure represents three 8-bit byte values and two 16-bit word values for a total of 7 bytes.
However, if you were to run this code in GCC for Windows, and print the sizeof() that structure, you would see it returns 10:
sizeof(MyStruct1) = 10
This is due to the compiler padding variables so they all start on a 16-bit boundary.
The expected data storage in memory feels like it should be:
[..|..|..|..|..|..|..] = 7 bytes | | | | | | | | | | | \ / byte3 | | | | word2 | \ / byte2 | word1 byte1
But, using GCC on a Windows 10 machine shows that each value is stored on a 16-bit boundary, leaving unused padding bytes after the 8-bit values:
[..|xx|..|..|..|xx|..|..|..|xx] = 10 bytes | | | | | | | | | | | \ / byte3 | | | | word2 | \ / byte2 | word1 byte1
As you can see, three extra bytes were added to the “blob” of memory that contains this structure. This is being done so each element starts on an even-byte address (0, 2, 4, etc.). Some processors require this, but if you were using one that allowed odd-byte access, you would likely get a sizeof() 7.
Do not rely on processor architecture
To create portable C, you must not rely on the behavior of how things work on your environment. The same can/will could produce different results on a different environment.
See also: sizeof() matters, where I demonstrated a simple example of using “int” and how it was quite different on a 16-bit Arduino versus a 32/64-bit PC.
Make it smaller
One easy thing to do to reduce wasted memory in structures is to try to group the 8-bit values together. Using the earlier structure example, by simple changing the ordering of values, we can reduce the amount of memory it uses:
typedef struct
{
uint8_t byte1; // 1
uint8_t byte2; // 1
uint8_t byte3; // 1
uint16_t word1; // 2
uint16_t word2; // 2
// 7 bytes
} MyStruct2;
On a Windows 10 GCC compiler, this will produce:
sizeof(MyStruct1) = 8
It is still not the 7 bytes we might expect, but at least the waste is less. In memory, it looks like this:
[..|..|..|xx|..|..|..|..] = 8 bytes | | | | | \ / | | | \ / word2 | | | word1 | | byte3 | byte2 byte1
You can see an extra byte of padding being added after the third 8-bit value. Just out of curiosity, I moved the third byte to the end of the structure like this:
typedef struct
{
uint8_t byte1; // 1
uint8_t byte2; // 1
uint16_t word1; // 2
uint16_t word2; // 2
uint8_t byte3; // 1
// 7 bytes
} MyStruct3;
…but that also produced 8. I believe it is just adding an extra byte of padding at the end (which doesn’t seem necessary, but perhaps memory must be reserved on even byte boundaries and this just marks that byte as used so the next bit of memory would start after it).
[..|..|..|..|..|..|..|xx] = 8 bytes | | | | | | | | | | | \ / byte3 | | \ / word2 | | word1 | byte2 byte1
Because you cannot ensure how a structure ends up in memory without knowing how the compiler works, it is best to simply not rely or expect a structure to be “packed” with all the bytes aligned like the code. You also cannot expect the memory usage is just the values contained in the structure.
I do frequently see programmers attempt to massage the structure by adding in padding values, such as:
typedef struct
{
uint8_t byte1; // 1
uint8_t padding1; // 1
uint16_t word1; // 2
uint8_t byte2; // 1
uint8_t padding2; // 1
uint16_t word2; // 2
uint8_t byte3; // 1
uint8_t padding3; // 1
// 10 bytes
} MyPaddedStruct1;
At least on a system that aligns values to 16-bits, the structure now matches what we actually get. But what if you used a processor where everything was aligned to 32-bits?
It is always best to not assume. Code written for an Arduino one day (with 16-bit integers) may be ported to a 32-bit Raspberry Pi Pico at some point, and not work as intended.
Here’s some sample code to try. You would have to change the printfs to Serial.println() and change how it prints the sizeof() values, but then you could see what it does on a 16-bit Arduino UNO versus a 32-bit PC or other system.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct
{
uint8_t byte1; // 1
uint16_t word1; // 2
uint8_t byte2; // 1
uint16_t word2; // 2
uint8_t byte3; // 1
// 7 bytes
} MyStruct1;
typedef struct
{
uint8_t byte1; // 1
uint8_t byte2; // 1
uint8_t byte3; // 1
uint16_t word1; // 2
uint16_t word2; // 2
// 7 bytes
} MyStruct2;
typedef struct
{
uint8_t byte1; // 1
uint8_t byte2; // 1
uint16_t word1; // 2
uint16_t word2; // 2
uint8_t byte3; // 1
// 7 bytes
} MyStruct3;
int main()
{
printf ("sizeof(MyStruct1) = %u\n", (unsigned int)sizeof(MyStruct1));
printf ("sizeof(MyStruct2) = %u\n", (unsigned int)sizeof(MyStruct2));
printf ("sizeof(MyStruct3) = %u\n", (unsigned int)sizeof(MyStruct3));
return EXIT_SUCCESS;
}
Until next time…
The padding at the end is to ensure an array of structures will have all elements aligned properly since offsets in arrays are basically “arraypointer + index * sizeof(element)”. (Calculating the offset is more efficient for some sizes as well.) If it’s not obvious why that needs to be the case, consider arrays allocated with malloc(). They have the same alignment requirements but malloc() has no idea that it’s allocating an array of structures. It just gets a number of bytes, which will be wrong if the alignment padding isn’t included in sizeof.
Memory allocation itself may also benefit from certain sizes, but sizeof and structs don’t need to care about that. Only the allocator needs to care about that.
I suppose it is similar to how malloc() appears to work on even bytes, so it always rounds allocations to the next even byte?
I’m not actually sure why the C compiler adds the padding, and I may be wrong on some of what I’m about to say, but this is my understanding. Some CPUs can retrieve a word (2 bytes) in one “cycle” as long as it’s from an even memory address. If it’s from an odd memory address, the CPU must retrieve the first half (the byte before it and the first byte), then the second half (the second byte and the byte after) and piece the parts together. This is a lot more inefficient (requires more CPU cycles), so it could be that the compilers try to align multi-byte data (such as words) on even addresses for performance gains.
We have a 16-bit C compiler at work that takes int32s, floats and doubles. It puts global int8s side by side and we found a bug that happened when accessing a value from an ISR at the same time another side by side value was being modified. It would corrupt. I seem to recall that the architecture allowed direct access in lower RAM, but after a certain point it had to generate extra code t9 retrieve, modify and store the variable, so it only happened to variables that existed there. Weird but to track down and figure out but I found it.