Followers of my ramblings know that I enjoy benchmarking BASIC. It is interesting to see how minor changes can produce major speed differences. And while BASIC is supposedly “completely predictable,” there are so many items that must be considered — line numbers, line length, number of variables, amount of strings, etc. — you can’t really look at any bit of code and know how fast it will run unless it’s something self contained like this:
10 FOR A=1 TO 1000 20 NEXT
Beyond things like that, there’s a lot of trail-and-error needed. Code like this:
... 100 FOR A=1 TO 100 110 Z=Z+1 120 NEXT ...
…can have dramatically different speeds depending on how many other variables there are, and where Z was declared in the list of them.
6809 assembly language is far more predictable. Every machine language instruction has a known amount of CPU cycles it takes to operate — based on variations of that code. For example, loading register A with a value:
…should be 100% predictable simply by looking up the cycle counts for “load A” in some 6809 reference guide. The Motorola data sheet for the 6809 tells me that “LDA” takes 2 cycles.
So, really, there’s no benchmarking needed. You just have to look up the instructions (and their specific type — direct, indirect, etc.) and add up some numbers.
But where’s the fun in that?
William “Lost Wizard” Astle‘s LWTOOLS provides us with the lwasm assembler for Mac, Windows, Linux, etc. One of its features is cycle counting. It can generate a list of all the machine language bytes that the assembly source code turns in to and, optionally, include the cycle count for each line of assembly code.
I just learned about this and had to experiment…
Here is a simple assembly loop that clears the 32-column screen. I’ll add some comments that explain what it does, as if it were BASIC…
clear lda #96 * A=96 (green space) clearwitha ldx #1024 * X=1024 (top left of screen) loop sta ,x+ * POKE X,A:X=X+1 cmpx #1536 * Compare X to 1536 bne loop * If X<>1536, GOTO loop rts * RETURN
To clear the screen to spaces (character 96), it is called with:
To clear the screen with a different value, such as 128 for black, it can be called like this:
lda #128 bsr clearwitha
LWASM is able to tell me how many CPU cycles each instruction will take. To generate this, you have to include a special pragma command in the source code, or pass it in on the command line. In source code, it is done by using the special “opt” keyword followed by the pragma. The ones we are interested in are listed in the manual:
opt c - enable cycle counts:  opt cd - enable detailed cycle counts breaking down addressing modes: [5+3] opt ct - show a running subtotal of cycles opt cc - clear the running subtotal
Adding ” opt c” at the top of the source code will enable it, and then you would use the “-l” command line option to generate the list file which will not contain cycle counts. (You can also send the list output to a file using -lfilename if you prefer.)
You can also pass in this pragma using a command line “–pragma=c”, like this:
lwasm clear.asm -fbasic -oclear.bas --pragma=c -l
Above, I am assembling the program in to a BASIC loader which I can load from BASIC, and then RUN to load the machine language program in to memory. Here is what that command displays for me:
allenh@alsmacbookpro asm % lwasm clear.asm -fbasic -oclear.bas --pragma=c -l 0000 ( clear.asm):00001 clear 0000 8660 ( clear.asm):00002 (2) lda #96 * A=96 (green space) 0002 ( clear.asm):00003 clearwitha 0002 8E0400 ( clear.asm):00004 (3) ldx #1024 * X=1024 (top left of screen) 0005 ( clear.asm):00005 loop 0005 A780 ( clear.asm):00006 (5) sta ,x+ * POKE X,A:X=X+1 0007 8C0600 ( clear.asm):00007 (3) cmpx #1536 * Compare X to 1536 000A 26F9 ( clear.asm):00008 (3) bne loop * If X<>1536, GOTO loop 000C 39 ( clear.asm):00009 (4) rts * RETURN ( clear.asm):00010 ( clear.asm):00011 END
That’s a bit too wide of a listing for my comfort, so from now on I’ll just include the right portion of it — starting with the (number) in parenthesis. That is the cycle count. If you look at the “lda #96” line you will see it confirms “lda” takes two cycles:
(2) lda #96 * A=96 (green space)
Another pragma of interest is one that will start counting the total number of cycles code takes.
opt ct - show a running subtotal of cycles
If you just turned it on, it would be not be very useful since it would just be adding up all the instructions from top to bottom of the source, not taking in to consideration branching or subroutines or loops. But, we can clear that counter and start it at any point by including the “opt” keyword in the code around routines we are interested in.
opt cc - clear the running subtotal
And we can turn them off by putting “no” in front:
opt noct - STOP showing a running subtotal of cycles
In the case of my clear.asm program, I would want to clear the counter and turn it on right at the start of loop, and turn it off at the end of the loop. This would show me a running count of how many cycles that loop takes:
clear (2) lda #96 clearA (3) ldx #1024 opt ct,cc loop (5) 5 sta ,x+ (3) 8 cmpx #1536 (3) 11 bne loop opt noct (4) rts
The numbers to the right of the (cycle) count numbers are the sum of all instructions from the moment the counter was enabled.
The code from “loop” to “bne loop” takes 11 cycles. Since each loop sets one byte on the screen, and since there are 512 bytes on the screen, clearing the screen this way will take 11 * 512 = 5632 cycles (plus a few extra before the loop, setting up X and A).
Instead of clearing the screen 8-bits at a time, I learned that using a 16-bit register would be faster. I changed the code to use a 16-bit D register instead of the 8-bit A register, like this:
clear16 lda #96 * A=96 tfr a,b * B=A (D=A*256+B) clearA16 ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop16 std ,x++ * POKE X,A:POKE X+1,B:X=X+1 cmpx #1536 * Compare X to 1536. bne loop16 * If X<>1536, GOTO loop16 opt noct * Turn off counter. rts * RETURN
Since 16-bit register D is made up of 8-bit registers A and B, I simply transfer whatever is in A to B and that makes both bytes of D the same as A. Then in the loop, I store D at X, and increment it by 2 (to get to the next two bytes). Looking at cycles again…
clear16 (2) lda #96 (4) tfr a,b clearA16 (3) ldx #1024 opt ct,cc loop16 (7) 7 std ,x++ (3) 10 cmpx #1536 (3) 13 bne loop16
The code from “loop16” to “bne loop16” takes 13 cycles, which is longer than the original. But, each loop does two bytes instead of one. Instead of needing 512 times through the loop, it only needs 256. 13 * 256 = 3328 cycles. Progress!
And, if we can do 16-bits at a time, why not 32? Currently, D is the value to store, and X is where to store it. We could just store D twice in a row…
clear32 lda #96 * A=96 tfr a,b * B=A (D=A*256+B) clearA32 ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop32 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 cmpx #1536 * Compare X to 1536. bne loop32 * If X<>1536, GOTO loop32 opt noct * Turn off counter. rts * RETURN
Let’s see what that does…
clear32 (2) lda #96 * A=96 (4) tfr a,b * B=A (D=A*256+B) clearA32 (3) ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop32 (7) 7 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 14 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (3) 17 cmpx #1536 * Compare X to 1536. (3) 20 bne loop32 * If X<>1536, GOTO loop32 opt noct * Turn off counter. (4) rts * RETURN
Above, the “loop32” to “bne loop32” takes 20 cycles. Each loop does four bytes, so only 128 times through the loop to clear all 512 bytes of the screen. 20 * 128 = 2560 cycles. More than double the speed of the original one byte version.
We could do 48-bits at a time by storing three times, but that math doesn’t work out since 512 is not divisible by 6 (I get 85.33333333). Perhaps we could do the loop 85 times to clear the first 510 bytes (6 * 85 = 510), then manually do one last 16-bit store to complete it. Maybe like this:
clear48 lda #96 * A=96 tfr a,b * B=A (D=A*256+B) clearA48 ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop48 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 cmpx #1536 * Compare X to 1536. bne loop48 * If X<>1536, GOTO loop32 opt noct * Turn off counter. std ,x * POKE X,A:POKE X+1,B:X=X+2 rts * RETURN
And LWASM shows me:
clear48 (2) lda #96 * A=96 (4) tfr a,b * B=A (D=A*256+B) clearA48 (3) ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop48 (7) 7 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 14 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 21 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (3) 24 cmpx #1536 * Compare X to 1536. (3) 27 bne loop48 * If X<>1536, GOTO loop32 opt noct * Turn off counter. (5) std ,x * POKE X,A:POKE X+1,B:X=X+2 (4) rts * RETURN
We have jumped to 27 cycles per loop. Each loop stores 6 bytes, and it takes 85 times to get 510 bytes, plus 5 extra after it is over for the last two bytes. 27 * 85 = 2295 cycles + 5 = 2300 cycles! We are still moving in the right direction.
Just for fun, what if we did four stores, 8 bytes at a time?
clear64 lda #96 * A=96 tfr a,b * B=A (D=A*256+B) clearA64 ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop64 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 cmpx #1536 * Compare X to 1536. bne loop64 * If X<>1536, GOTO loop32 opt noct * Turn off counter. rts * RETURN
And that gives us:
clear64 (2) lda #96 * A=96 (4) tfr a,b * B=A (D=A*256+B) clearA64 (3) ldx #1024 * X=1024 (top left of screen) opt ct,cc * Clear counter, turn it on. loop64 (7) 7 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 14 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 21 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (7) 28 std ,x++ * POKE X,A:POKE X+1,B:X=X+2 (3) 31 cmpx #1536 * Compare X to 1536. (3) 34 bne loop64 * If X<>1536, GOTO loop32 opt noct * Turn off counter. (4) rts * RETURN
34 cycles stores 8 bytes. 64 times through the loop to do all 512 screen bytes, so 64 * 34 = 2176 cycles.
By now, I think you can see where this is going. I believe this is called “loop unrolling”, since, if you wanted the fewest cycles, you could just code 256 “std ,x++” in a row (7 * 256) for 1792 cycles which would be fast but bulky code (each std ,x++ takes two bytes, so 512 bytes just for this copy routine).
There is always some balance between code size and speed. Larger programs took longer to load from tape or disk. But, if you didn’t mind load time, and you had extra memory available, tricks like this could really speed things up.
I have also read about “stack blasting” where you load values in to registers and then, instead of storing each register, you set a stack pointer to the destination and just push the registers on the stack. I’ve never done that before. Let’s see if we can figure it out.
There are two stacks in the 6809 — one is the normal one used by the program (SP, I believe is the register?), and the other is the User Stack (register U). If we aren’t using it for a stack, we can use it as a 16-bit register, too.
The stack grows “up”, so if the stack pointer is 5000, and you push an 8-bit register, the pointer will move to 4999 (pointing to the most recent register pushed). If you then push a 16-bit register, it will move to 4997. This means it will have to work in reverse from our previous examples. By pointing the stack register to the end of the screen, we should be able to push registers on to the stack causing it to grow “up” to the top of the screen.
At first glance, it doesn’t look promising, since pushing D on to the user stack (U) takes more cycles than storing D at U:
(5) std ,u (6) pshu d
But, it seems we make that up when pushing multiple registers since the cycle count does not grow as much as multiple stores do:
(7) std ,U++ (7) stx ,U++ (8) sty ,U++ (10) pshu d,x,y
I also I see that STY is one cycle longer than STD or STX. This tells me to maybe avoid using Y like this…?
It looks good, though. 22 cycles compared to 10 seems quite the win. Let me see if I can do a clear routine using the User stack pointer and three 16-bit registers. We’ll compare this to the 48-bit clear shown earlier.
clear48s lda #96 * A=96 clearA48s tfr a,b * B=A (D=A*256+B) tfr d,x * X=D tfr d,y * Y=D ldu #1536 * U=1536 (1 past end of screen) opt ct,cc * Clear counter, turn it on. loop48s pshu d,x,y cmpu #1026 * Compare U to 1026 (two bytes from start). bgt loop48s * If X<>1026, GOTO loop48s. opt noct * Turn off counter. pshu d * Final 2 bytes. rts * RETURN
And the results are…
clear48s (2) lda #96 * A=96 clearA48s (4) tfr a,b * B=A (D=A*256+B) (4) tfr d,x * X=D (4) tfr d,y * Y=D (3) ldu #1536 * U=1536 (1 past end of screen) opt ct,cc * Clear counter, turn it on. loop48s (10) 10 pshu d,x,y (4) 14 cmpu #1026 * Compare U to 1026 (two bytes from start). (3) 17 bgt loop48s * If X>1026, GOTO loop48s opt noct * Turn off counter. (6) pshu d * Final 2 bytes. (4) rts * RETURN
From “loop48s” to “bgt loop48s” we end up with 17 cycles compared to 27 using the std method. 85 * 17 = 1445 cycles + 6 final cycles = 1551 cycles. It looks like using stack push/pulls might be a real nice way to do this type of thing, provided the user stack is available, of course.
Side Note: here is a fantastic writeup of this and the techniques on the 6809, as used in some unnamed CoCo 3 game back in the day: https://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
The fastest way to zero
But wait! There’s more…
When setting a register to zero, I have been told to use “CLR” instead of “LDx #0”. Let’s see what that is all about…
(2) lda #0 (1) clra (3) ldd #0 (2) clrd
Ah, now know a CLRA is twice as fast as LDA #0, and CLRD is one cycle faster than LDD #0. Nice.
Other 16-bit registers such as X, Y, and U do not have a CLR op code, so LDx will be have to be used there, I suppose.
I then wondered if it made more sense to CLR a memory location, or clear a register then store that register there.
(6) clr 1024 (1) clra (4) sta 1024
It appears in this case, it is less cycles to clear a register then store it in memory. Interesting. And using a 16-bit value:
(3) ldd #0 (5) std 1024
That is one cycle faster than doing a “clra / sta 1024 / sta 1025” it seems. It is also one byte less in size, so win win.
There is a lot to learn here, and from these experiments, I’m already seeing some things are not like I would have guessed.
I hope this inspires you to play with these LWASM options and see what your code is doing. During the writing of this article, I learned how to use that User Stack, and I expect that will come in handy if I decide to do any updates to my Invaders09 game some day…
Until next time…
Those cycle counts look like they’re 6309 native mode counts. You’ll want to give lwasm the –6809 option to make sure it’s counting 6809 counts. CLRA should be 2 cycles for the 6809, the same as LDA #0, but has the advantage of being a byte shorter. For D, LDD #0 is faster than a CLRA/CLRB sequence (3 cycles vs 4 on a 6809) but it’s a byte longer as well. CLRD is 6309 only.
Also, for the unrolled loop, it may be faster to do something like:
There’s a point where you start saving cycles by skipping the double increment using constant offsets and then adjusting the pointer in one go.
Does it not default to building 6809? Have I been getting lucky just using my default command line, running on XRoar?
Another thing to keep in mind with an instruction like CLRA is that the CARRY flag will also be cleared whereas LDA #0 does not affect the flag. It’s not usually an issue but there are some cases where I have found this to be important (such as multi-precision arithmetic or when the flag has been set to indicate an error condition).
Regarding CLR on memory locations: It wondered me all the time why it take the same behavior like other read-modify-write opcodes like ASL, COM, INC, DEC and so on. I think it was a “cheap” implementation. I guess the CPU does really a Read, Set-value-to-zero, Write sequence. The CLR operation doesn’t seem to be conceptually designed as a pseudo-register (containing 0) write.