Category Archives: 6809 Assembly

Counting 6809 cycles with LWASM

Followers of my ramblings know that I enjoy benchmarking BASIC. It is interesting to see how minor changes can produce major speed differences. And while BASIC is supposedly “completely predictable,” there are so many items that must be considered — line numbers, line length, number of variables, amount of strings, etc. — you can’t really look at any bit of code and know how fast it will run unless it’s something self contained like this:

10 FOR A=1 TO 1000
20 NEXT

Beyond things like that, there’s a lot of trail-and-error needed. Code like this:

...
100 FOR A=1 TO 100
110 Z=Z+1
120 NEXT
...

…can have dramatically different speeds depending on how many other variables there are, and where Z was declared in the list of them.

6809 assembly language is far more predictable. Every machine language instruction has a known amount of CPU cycles it takes to operate — based on variations of that code. For example, loading register A with a value:

lda #42

…should be 100% predictable simply by looking up the cycle counts for “load A” in some 6809 reference guide. The Motorola data sheet for the 6809 tells me that “LDA” takes 2 cycles.

So, really, there’s no benchmarking needed. You just have to look up the instructions (and their specific type — direct, indirect, etc.) and add up some numbers.

But where’s the fun in that?

William “Lost Wizard” Astle‘s LWTOOLS provides us with the lwasm assembler for Mac, Windows, Linux, etc. One of its features is cycle counting. It can generate a list of all the machine language bytes that the assembly source code turns in to and, optionally, include the cycle count for each line of assembly code.

I just learned about this and had to experiment…

Here is a simple assembly loop that clears the 32-column screen. I’ll add some comments that explain what it does, as if it were BASIC…

clear
    lda #96     * A=96 (green space)
clearwitha
    ldx #1024   * X=1024 (top left of screen)
loop
    sta ,x+     * POKE X,A:X=X+1
    cmpx #1536  * Compare X to 1536 
    bne loop    * If X<>1536, GOTO loop
    rts         * RETURN

To clear the screen to spaces (character 96), it is called with:

 bsr clear

To clear the screen with a different value, such as 128 for black, it can be called like this:

 lda #128
 bsr clearwitha

LWASM is able to tell me how many CPU cycles each instruction will take. To generate this, you have to include a special pragma command in the source code, or pass it in on the command line. In source code, it is done by using the special “opt” keyword followed by the pragma. The ones we are interested in are listed in the manual:

opt c  - enable cycle counts: [8]
opt cd - enable detailed cycle counts breaking down addressing modes: [5+3]
opt ct - show a running subtotal of cycles
opt cc - clear the running subtotal

Adding ” opt c” at the top of the source code will enable it, and then you would use the “-l” command line option to generate the list file which will not contain cycle counts. (You can also send the list output to a file using -lfilename if you prefer.)

You can also pass in this pragma using a command line “–pragma=c”, like this:

lwasm clear.asm -fbasic -oclear.bas --pragma=c -l

Above, I am assembling the program in to a BASIC loader which I can load from BASIC, and then RUN to load the machine language program in to memory. Here is what that command displays for me:

allenh@alsmacbookpro asm % lwasm clear.asm -fbasic -oclear.bas --pragma=c -l
0000                  (        clear.asm):00001         clear
0000 8660             (        clear.asm):00002 (2)         lda #96     * A=96 (green space)
0002                  (        clear.asm):00003         clearwitha
0002 8E0400           (        clear.asm):00004 (3)         ldx #1024   * X=1024 (top left of screen)
0005                  (        clear.asm):00005         loop
0005 A780             (        clear.asm):00006 (5)         sta ,x+     * POKE X,A:X=X+1
0007 8C0600           (        clear.asm):00007 (3)         cmpx #1536  * Compare X to 1536 
000A 26F9             (        clear.asm):00008 (3)         bne loop    * If X<>1536, GOTO loop
000C 39               (        clear.asm):00009 (4)         rts         * RETURN
                      (        clear.asm):00010         
                      (        clear.asm):00011             END

That’s a bit too wide of a listing for my comfort, so from now on I’ll just include the right portion of it — starting with the (number) in parenthesis. That is the cycle count. If you look at the “lda #96” line you will see it confirms “lda” takes two cycles:

(2)         lda #96     * A=96 (green space)

Another pragma of interest is one that will start counting the total number of cycles code takes.

opt ct - show a running subtotal of cycles

If you just turned it on, it would be not be very useful since it would just be adding up all the instructions from top to bottom of the source, not taking in to consideration branching or subroutines or loops. But, we can clear that counter and start it at any point by including the “opt” keyword in the code around routines we are interested in.

opt cc - clear the running subtotal

And we can turn them off by putting “no” in front:

opt noct - STOP showing a running subtotal of cycles

In the case of my clear.asm program, I would want to clear the counter and turn it on right at the start of loop, and turn it off at the end of the loop. This would show me a running count of how many cycles that loop takes:

        clear
(2)         lda #96
        clearA
(3)         ldx #1024
            opt ct,cc
                loop
(5)     5           sta ,x+
(3)     8           cmpx #1536
(3)     11          bne loop
                    opt noct
(4)         rts

The numbers to the right of the (cycle) count numbers are the sum of all instructions from the moment the counter was enabled.

The code from “loop” to “bne loop” takes 11 cycles. Since each loop sets one byte on the screen, and since there are 512 bytes on the screen, clearing the screen this way will take 11 * 512 = 5632 cycles (plus a few extra before the loop, setting up X and A).

Instead of clearing the screen 8-bits at a time, I learned that using a 16-bit register would be faster. I changed the code to use a 16-bit D register instead of the 8-bit A register, like this:

clear16
    lda #96     * A=96
    tfr a,b     * B=A (D=A*256+B)
clearA16
    ldx #1024   * X=1024 (top left of screen)
    opt ct,cc   * Clear counter, turn it on.
loop16
    std ,x++    * POKE X,A:POKE X+1,B:X=X+1
    cmpx #1536  * Compare X to 1536.
    bne loop16  * If X<>1536, GOTO loop16
    opt noct    * Turn off counter.
    rts         * RETURN

Since 16-bit register D is made up of 8-bit registers A and B, I simply transfer whatever is in A to B and that makes both bytes of D the same as A. Then in the loop, I store D at X, and increment it by 2 (to get to the next two bytes). Looking at cycles again…

        clear16
(2)         lda #96
(4)         tfr a,b
        clearA16
(3)         ldx #1024
            opt ct,cc
                loop16
(7)     7           std ,x++
(3)     10          cmpx #1536
(3)     13          bne loop16

The code from “loop16” to “bne loop16” takes 13 cycles, which is longer than the original. But, each loop does two bytes instead of one. Instead of needing 512 times through the loop, it only needs 256. 13 * 256 = 3328 cycles. Progress!

And, if we can do 16-bits at a time, why not 32? Currently, D is the value to store, and X is where to store it. We could just store D twice in a row…

clear32
    lda #96     * A=96
    tfr a,b     * B=A (D=A*256+B)
clearA32
    ldx #1024   * X=1024 (top left of screen)
    opt ct,cc   * Clear counter, turn it on.
loop32
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    cmpx #1536  * Compare X to 1536.
    bne loop32  * If X<>1536, GOTO loop32
    opt noct    * Turn off counter.
    rts         * RETURN

Let’s see what that does…

        clear32
(2)         lda #96     * A=96
(4)         tfr a,b     * B=A (D=A*256+B)
        clearA32
(3)         ldx #1024   * X=1024 (top left of screen)
            opt ct,cc   * Clear counter, turn it on.
                loop32
(7)     7           std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     14          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(3)     17          cmpx #1536  * Compare X to 1536.
(3)     20          bne loop32  * If X<>1536, GOTO loop32
                    opt noct    * Turn off counter.
(4)         rts         * RETURN

Above, the “loop32” to “bne loop32” takes 20 cycles. Each loop does four bytes, so only 128 times through the loop to clear all 512 bytes of the screen. 20 * 128 = 2560 cycles. More than double the speed of the original one byte version.

We could do 48-bits at a time by storing three times, but that math doesn’t work out since 512 is not divisible by 6 (I get 85.33333333). Perhaps we could do the loop 85 times to clear the first 510 bytes (6 * 85 = 510), then manually do one last 16-bit store to complete it. Maybe like this:

clear48
    lda #96     * A=96
    tfr a,b     * B=A (D=A*256+B)
clearA48
    ldx #1024   * X=1024 (top left of screen)
    opt ct,cc   * Clear counter, turn it on.
loop48
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    cmpx #1536  * Compare X to 1536.
    bne loop48  * If X<>1536, GOTO loop32
    opt noct    * Turn off counter.
    std ,x      * POKE X,A:POKE X+1,B:X=X+2
    rts         * RETURN

And LWASM shows me:

        clear48
(2)         lda #96     * A=96
(4)         tfr a,b     * B=A (D=A*256+B)
        clearA48
(3)         ldx #1024   * X=1024 (top left of screen)
            opt ct,cc   * Clear counter, turn it on.
                loop48
(7)     7           std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     14          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     21          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(3)     24          cmpx #1536  * Compare X to 1536.
(3)     27          bne loop48  * If X<>1536, GOTO loop32
                    opt noct    * Turn off counter.
(5)                 std ,x      * POKE X,A:POKE X+1,B:X=X+2
(4)         rts         * RETURN

We have jumped to 27 cycles per loop. Each loop stores 6 bytes, and it takes 85 times to get 510 bytes, plus 5 extra after it is over for the last two bytes. 27 * 85 = 2295 cycles + 5 = 2300 cycles! We are still moving in the right direction.

Just for fun, what if we did four stores, 8 bytes at a time?

clear64
    lda #96     * A=96
    tfr a,b     * B=A (D=A*256+B)
clearA64
    ldx #1024   * X=1024 (top left of screen)
    opt ct,cc   * Clear counter, turn it on.
loop64
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    std ,x++    * POKE X,A:POKE X+1,B:X=X+2
    cmpx #1536  * Compare X to 1536.
    bne loop64  * If X<>1536, GOTO loop32
    opt noct    * Turn off counter.
    rts         * RETURN

And that gives us:

        clear64
(2)         lda #96     * A=96
(4)         tfr a,b     * B=A (D=A*256+B)
        clearA64
(3)         ldx #1024   * X=1024 (top left of screen)
            opt ct,cc   * Clear counter, turn it on.
                loop64
(7)     7           std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     14          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     21          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(7)     28          std ,x++    * POKE X,A:POKE X+1,B:X=X+2
(3)     31          cmpx #1536  * Compare X to 1536.
(3)     34          bne loop64  * If X<>1536, GOTO loop32
                    opt noct    * Turn off counter.
(4)         rts         * RETURN

34 cycles stores 8 bytes. 64 times through the loop to do all 512 screen bytes, so 64 * 34 = 2176 cycles.

By now, I think you can see where this is going. I believe this is called “loop unrolling”, since, if you wanted the fewest cycles, you could just code 256 “std ,x++” in a row (7 * 256) for 1792 cycles which would be fast but bulky code (each std ,x++ takes two bytes, so 512 bytes just for this copy routine).

There is always some balance between code size and speed. Larger programs took longer to load from tape or disk. But, if you didn’t mind load time, and you had extra memory available, tricks like this could really speed things up.

Blast it…

I have also read about “stack blasting” where you load values in to registers and then, instead of storing each register, you set a stack pointer to the destination and just push the registers on the stack. I’ve never done that before. Let’s see if we can figure it out.

There are two stacks in the 6809 — one is the normal one used by the program (SP, I believe is the register?), and the other is the User Stack (register U). If we aren’t using it for a stack, we can use it as a 16-bit register, too.

The stack grows “up”, so if the stack pointer is 5000, and you push an 8-bit register, the pointer will move to 4999 (pointing to the most recent register pushed). If you then push a 16-bit register, it will move to 4997. This means it will have to work in reverse from our previous examples. By pointing the stack register to the end of the screen, we should be able to push registers on to the stack causing it to grow “up” to the top of the screen.

At first glance, it doesn’t look promising, since pushing D on to the user stack (U) takes more cycles than storing D at U:

(5)         std ,u

(6)         pshu d

But, it seems we make that up when pushing multiple registers since the cycle count does not grow as much as multiple stores do:

(7)         std ,U++
(7)         stx ,U++
(8)         sty ,U++
        
(10)        pshu d,x,y

I also I see that STY is one cycle longer than STD or STX. This tells me to maybe avoid using Y like this…?

It looks good, though. 22 cycles compared to 10 seems quite the win. Let me see if I can do a clear routine using the User stack pointer and three 16-bit registers. We’ll compare this to the 48-bit clear shown earlier.

clear48s
    lda #96     * A=96
clearA48s
    tfr a,b     * B=A (D=A*256+B)
    tfr d,x     * X=D
    tfr d,y     * Y=D
    ldu #1536   * U=1536 (1 past end of screen)
    opt ct,cc   * Clear counter, turn it on.
loop48s
    pshu d,x,y
    cmpu #1026  * Compare U to 1026 (two bytes from start).
    bgt loop48s * If X<>1026, GOTO loop48s. 
    opt noct    * Turn off counter.
    pshu d      * Final 2 bytes.
    rts         * RETURN

And the results are…

        clear48s
(2)         lda #96     * A=96
        clearA48s
(4)         tfr a,b     * B=A (D=A*256+B)
(4)         tfr d,x     * X=D
(4)         tfr d,y     * Y=D
(3)         ldu #1536   * U=1536 (1 past end of screen)
            opt ct,cc   * Clear counter, turn it on.
                loop48s
(10)    10          pshu d,x,y
(4)     14          cmpu #1026  * Compare U to 1026 (two bytes from start).
(3)     17          bgt loop48s * If X>1026, GOTO loop48s
                    opt noct    * Turn off counter.
(6)                 pshu d      * Final 2 bytes.
(4)         rts         * RETURN

From “loop48s” to “bgt loop48s” we end up with 17 cycles compared to 27 using the std method. 85 * 17 = 1445 cycles + 6 final cycles = 1551 cycles. It looks like using stack push/pulls might be a real nice way to do this type of thing, provided the user stack is available, of course.

Side Note: here is a fantastic writeup of this and the techniques on the 6809, as used in some unnamed CoCo 3 game back in the day: https://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html

The fastest way to zero

But wait! There’s more…

When setting a register to zero, I have been told to use “CLR” instead of “LDx #0”. Let’s see what that is all about…

(2)         lda #0
(1)         clra

(3)         ldd #0
(2)         clrd

Ah, now know a CLRA is twice as fast as LDA #0, and CLRD is one cycle faster than LDD #0. Nice.

Other 16-bit registers such as X, Y, and U do not have a CLR op code, so LDx will be have to be used there, I suppose.

I then wondered if it made more sense to CLR a memory location, or clear a register then store that register there.

(6)         clr 1024
        
(1)         clra
(4)         sta 1024

It appears in this case, it is less cycles to clear a register then store it in memory. Interesting. And using a 16-bit value:

(3)         ldd #0
(5)         std 1024

That is one cycle faster than doing a “clra / sta 1024 / sta 1025” it seems. It is also one byte less in size, so win win.

There is a lot to learn here, and from these experiments, I’m already seeing some things are not like I would have guessed.

I hope this inspires you to play with these LWASM options and see what your code is doing. During the writing of this article, I learned how to use that User Stack, and I expect that will come in handy if I decide to do any updates to my Invaders09 game some day…

Until next time…

CoCo 6809 assembly save/restore screen routine.

Occasionally I see a really “nice little touch” that a programmer took the time to add. For instance, some programs will restore the screen to what it looked like before the program ran. I decided I would do this for a project I was working on, and thought I’d share the super simple routine:

* Save/Restore screen.
* lwasm savescreen.bas -fbasic -osavescreen.bas --map

    org $3f00

* Test function.
start
    * Save the screen.
    bsr savescreensub   * GOSUB savescreensub

    * Fill screen.
    ldx #SCREENSTART    * X=Start of screen.
    lda #255            * A=255 (orange block).
loop
    sta ,x+             * Store A at X, X=X+1.
    cmpx #SCREENEND     * Compare X to SCREENEND.
    ble loop            * IF X<=SCREENEND, GOTO loop.

    * Wait for keypress.
getkey    
    jsr [$a000]         * Call POLCAT ROM routine.
    beq getkey          * If no key, GOTO getkey.

    * Restore screen.
    bsr restorescreensub * GOSUB restorescreensub

    rts                 * RETURN

* Subroutine
SCREENSTART equ 1024    * Start of screen memory.
SCREENEND   equ 1536    * Last byte of screen memory.

savescreensub
    pshs x,y,d          * Save registers we will use.
    ldx #SCREENSTART    * X=Start of screen.   
    ldy #screenbuf      * Y=Start of buffer.
saveloop
    ldd ,x++            * Load D with 2 bytes at X, X=X+2.
    std ,y++            * Store D at Y, Y=Y+2.
    cmpx #SCREENEND     * Compare X to SCREENEND.
    blt saveloop        * If X<=SCREENEND, GOTO saveloop.
    puls x,y,d,pc       * Resture used registers and return.
    *rts

restorescreensub
    pshs x,y,d          * Save registers we will use.
    ldx #screenbuf      * X=Start of buffer.
    ldy #SCREENSTART    * Y=Start of screen.
restoreloop
    ldd ,x++            * Load D with 2 bytes at X, X=X+2.
    std ,y++            * Store D at Y, Y=Y+2.
    cmpy #1535          * Compare Y to SCREENEND.
    blt restoreloop     * If Y<=SCREENEND, GOTO restoreloop.
    puls x,y,d,pc       * Resture used registers and return.
    *rts

* This would go in your data area.
screenbuf rmb SCREENEND-SCREENSTART+1

    end

There are two routines – savescreensub and restorescreensub – named that way just so I would know they are subroutines designed to be called by bsr/lbsr/jsr.

They make use of a 512-byte buffer (in the case of the CoCo’s 32×16 screen).

savescreensub will copy all the bytes currently on the text screen over to the buffer. restorescreensub will copy all the saved bytes in the buffer back to the screen.

Some example code is provided.

What would you change?

Until next time…

GOTO, GOSUB, Stack Overflows and 6809 stack jumping.

While wandering through the Color/Extended/Disk BASIC Unraveled books trying to figure out how the RAM hooks worked, I came across a technique that I had never used.

So of course I’m going to digress with a bunch of other stuff first.

GOTO and GOSUB

In BASIC, you can run code using GOTO or GOSUB. GOTO jumps to a specific line number and runs from there. If that code needs to get back to the main loop, it has to do so with another GOTO.

10 REM MAIN LOOP
20 A$=INKEY$:IF A$="" THEN 20
30 IF A$="L" THEN GOTO 100
40 IF A$="R" THEN GOTO 200
50 GOTO 10

100 REM MOVE LEFT
...
190 GOTO 10

200 REM MOVE RIGHT
...
290 GOTO 10

This is fine for code that does one specific thing at one specific place, but the routines at 100 and 200 could not be used anywhere else in the program unless after such use they always resumed running at line 10.

GOSUB is often a better option, since it eliminates the need for the subroutine to know where it must GOTO at the end:

10 REM MAIN LOOP
20 A$=INKEY$:IF A$="" THEN 20
30 IF A$="L" THEN GOSUB 100
40 IF A$="R" THEN GOSUB 200
50 GOTO 10

100 REM MOVE LEFT
...
190 RETURN

200 REM MOVE RIGHT
...
290 RETURN

There are inefficiencies to the above code, as well as some potential problems, but it’s good enough for an example.

When GOSUB is seen, BASIC remembers the exact spot after the line number and saves it somewhere. It then jumps to that line number, and when a RETURN is seen, it retrieves the saved location and jumps back there to continue executing.

The location is saved on a stack, so you can GOSUB from a GOSUB from a GOSUB, as long as there is enough memory to remember all those locations.

Stack Notes

Think of the stack like a stack of POST-IT(tm) notes. When a GOSUB happens, the return location is written on a piece of paper, then that paper is placed somewhere. If another GOSUB is seen, that location is written on paper and then stuck on top of the previous one, and so on. You end up with a stack of locations. When a RETURN is seen, it grabs the top piece of paper and returns to that location, then that paper is discarded.

10 PRINT "TEST START"
20 GOSUB 100
30 PRINT "TEST END"
40 END

100 REM FIRST
110 PRINT "  FIRST START"
120 GOSUB 200
130 PRINT "  FIRST END"
140 RETURN

200 REM SECOND
210 PRINT "    SECOND START"
220 PRINT "    SECOND END"
230 RETURN

Running that program prints:

TEST START
  FIRST START
    SECOND START
    SECOND END
  FIRST END
TEST END

Test calls First which calls Second. When Second returns, it returns back to First. When First returns, it returns back to Start.

If you ever leave a GOSUB with a GOTO, that return location is still there, saved, and that memory is never returned to the BASIC program. This will crash a program:

10 PRINT X
20 X=X+1
30 GOSUB 10

Each GOSUB adds a return location to the BASIC stack, and since the program is recursively calling itself without ever RETURNing, it will eventually run out of BASIC stack space. In the test I just did, I received an ?OM ERROR (out of memory) at count 3247. On a system with less RAM available (smaller RAM, larger program, etc.) that will happen more often.

This is a STACK OVERFLOW, and languages like C, assembly, etc. can all have them. (I assume that’s where the Q&A site www.stackoverflow.com got its name from.)

Some environments have stack checking, and they will terminate the offending program with an error message when this happens. This is what happened with the ?OM ERROR. Beyond BASIC, operating systems generally take care of this stack checking. Programs written in C or 6809 assembly running under OS-9 most certainly will get terminated with a stack overflow if they try to use more than the OS reserved for them. (Ah, if I only understood this way back then. I just knew to keep adding more memory to a command until it ran without crashing…)

Assembly GOTO and GOSUB

In 6809 assembly, a GOTO equivalent would be like a BRx branch instruction or a JMP jump instruction. The earlier BASIC example might look like this in CoCo assembly:

mainloop
  jsr [$a002]   * Call ROM POLCAT routine, key comes back in A.
  beq mainloop  * If A="", GOTO mainloop.
  cmpa 'L       * Compare A to character "L".
  beq moveleft  * If A="L", GOTO moveleft.
  cmpa 'R       * Compare A to character "R".
  beq moveright * If A="R", GOTO moveright.
  bra mainloop  * GOTO mainloop.

moveleft
  ...
  bra mainloop  * GOTO mainloop.

moveright
  ...
   bra mainloop * GOTO mainloop.

For very simple logic, assembly can be quite similar to BASIC.

GOSUB would be BSR branch subroutine or JSR jump subroutine operation. Here is what the second BASIC example might look like in assembly:

  jsr [$a002]     * Call ROM POLCAT routine, key comes back in A.
  beq mainloop    * If A="", GOTO mainloop.
  cmpa 'L         * Compare A to character "L".
  bsr moveleft    * If A="L", GOSUB moveleft.
  cmpa 'R         * Compare A to character "R".
  bsr moveright   * If A="R", GOSUB moveright.
  bra mainloop    * GOTO mainloop.

moveleft
  ...
  rts             * RETURN.

moveright
  ...
  rts             * RETURN.

Very simple code like this would be a good way for a BASIC programmer to tip-toe in to the land of assembly language. It’s quite fun, until you realize how much work is needed for anything that is not as simple ;-)

And now the third example… Since assembly does not have a PRINT command, I created a simple subroutine that uses the ROM CHROUT routine to print out whatever character is in the A register.

* lwasm jsrtest.asm -fbasic -ojsrtest.bas --map

    org $3f00

start
    * 10 PRINT "TEST START"
    ldx #teststartmsg   * X=Start of message.
    jsr print           * GOSUB print.

    * 20 GOSUB 100
    jsr first           * GOSUB first.
    ldx #testendmsg     * X=Start of message.
    
    * 30 PRINT "TEST END"
    jsr print           * GOSUB print.
    
    * 40 END
    rts                 * RETURN

first
    * 110 PRINT "  FIRST START"
    ldx #firststartmsg  * X=Start of message.
    jsr print           * GOSUB print.

    * 120 GOSUB 200
    jsr second

    * 130 PRINT "  FIRST END"
    ldx #firstendmsg    * X=Start of message.
    jsr print           * GOSUB print.
    
    * 140 RETURN
    rts                 * RETURN

second
    * 210 PRINT "    SECOND START"
    ldx #secondstartmsg * X=Start of message.
    jsr print           * GOSUB print.
    
    * 230 PRINT "    SECOND END"
    ldx #secondendmsg   * X=Start of message.
    jsr print           * GOSUB print.
    
    * 240 RETURN
    rts                 * RETURN

* PRINT subroutine. Prints the string pointed to by X.
print
    lda ,x+
    beq done
    jsr [$a002]
    bra print
done
    lda #13
    jsr [$a002]
    rts

* Data storage for the string messages.
teststartmsg
    fcc "TEST START"
    fcb 0

testendmsg
    fcc "TEST END"
    fcb 0

firststartmsg
    fcc "  FIRST START"
    fcb 0

firstendmsg
    fcc "  FIRST END"
    fcb 0

secondstartmsg
    fcc "    SECOND START"
    fcb 0

secondendmsg
    fcc "    SECOND END"
    fcb 0

Here is a BASIC loader for the above assembly routine. You can load and RUN this, then type EXEC &H3F00 to run it.

10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 END
80 DATA 16128,16267,142,63,62,189,63,45,189,63,16,142,63,73,189,63,45,57,142,63,82,189,63,45,189,63,32,142,63,96,189,63,45,57,142,63,108,189,63,45,142,63,125,189,63,45,57,166,128,39,6,173,159,160,2,32,246,134,13,173,159,160,2,57,84,69,83,84,32
90 DATA 83,84,65,82,84,0,84,69,83,84,32,69,78,68,0,32,32,70,73,82,83,84,32,83,84,65,82,84,0,32,32,70,73,82,83,84,32,69,78,68,0,32,32,32,32,83,69,67,79,78,68,32,83,84,65,82,84,0,32,32,32,32,83,69,67,79,78,68,32,69,78,68,0,-1,-1

Stack Overflow in assembly

Just for fun… Here is the GOSUB crash program in assembly. 99% of this code is just a crappy routine I had to write to print out a decimal number.

    org $3f00

start
    ldx #0              * X=0
loop
    * 10 PRINT X
    jsr printx          * GOSUB printx.

    * 20 X=X+1
    leax 1,x            * X=X+1

    * 30 GOSUB 10
    bsr loop            * GOSUB loop.

    rts                 * Return to BASIC.

*
* Crappy routine I just put together to try to print out a decimal number.
*
printx
    * Init buffer to 000000.
    lda #'0
    sta numberstring
    sta numberstring+1
    sta numberstring+2
    sta numberstring+3
    sta numberstring+4
    sta numberstring+5
  
    * X is our counter.
    tfr x,d         * Copy X to D

tenthousands    
    cmpd #10000
    blt thousands
    subd #10000
    inc numberstring
    bra tenthousands

thousands
    cmpd #1000
    blt hundreds
    subd #1000
    inc numberstring+1
    bra thousands

hundreds
    cmpd #100
    blt tens
    subd #100
    inc numberstring+2
    bra hundreds

tens
    cmpd #10
    blt ones
    subd #10
    inc numberstring+3
    bra hundreds

ones
    cmpd #0
    blt print
    subd #1
    inc numberstring+4

print
    ldy #numberstring
printloop
    lda ,y+
    jsr [$a002]
    cmpy #bufferend
    bne printloop

    lda #13
    jsr [$a002]
    rts

numberstring fcb 5  * Holds 00000-99999
bufferend equ numberstring+5

Thank you for ignoring my poorly-coded “printx” subroutine.

When I run this, it crashes after printing 08141. I believe it is a much smaller number than the BASIC one because it has much less memory for the stack. Since this program starts in memory at the 32K mark (&H3F00), the stack has from end of RAM (&HFF00) down to the end of this program. As the stack grows, without stack checking, it eventually overwrites the running assembly code, crashing the computer.

Let’s pretend we never did that.

What are we learning?

At the start of this article, I mentioned something I just learned from looking at other assembly code. I learned how to get out of an assembly GOSUB routine without needing to return. Just like BASIC, calling a subroutine recursively will cause a crash. Unlike BASIC, there is no stack checking when running raw 6809 code without an operating system, so it can really crash BASIC and require a reset of the computer.

There is a way to GOTO out of an assembly routine without leaving that GOSUB program counter memory on the stack. You simply move the stack pointer by 2 places.

For example, say you had assembly code that was like this BASIC:

10 GOSUB 100
100 GOSUB 200
200 ...

The stack would look like this:

      <- Next GOSUB would be stored here.
[200] <- Top of stack. RETURN would use this.
[100]
[ 10]

BASIC has no way to throw away whatever GOSUB entry is on the top of the stack, but it is simple to do in assembly just by adding 2 to the S (stack pointer) register.

start
    jsr first    * GOSUB first.
    rts          * RETURN

first
    jsr second   * GOSUB second.
    rts          * RETURN

second
    leas 2,S     * Move stack pointer down two bytes.

    rts          * RETURN

By the time the code gets to “second”, the assembly stack should look like this:

         <- Next bsr/jsr would be stored here.
[first]  <- Top of stack. RTS would use this.
[start]

When the second routine does “leas 2,s”, the stack pointer moves down and it looks like this:

         
[xxxxx]  <- Next bsr/jsr would be stored here.
[start]  <- Top of stack. RTS would use this.

Side Note: Data on the stack is never erased, but will be overwritten the next time something is stored there. The [xxxxx] is actually still [first].

Now if the subroutine does an RTS, it will be returning to start and not first. Thus, if you add that to the assembly and run it, the output will be:

TEST START
  FIRST START
    SECOND START
    SECOND END
TEST END

I do not know of a legal way to do the same in BASIC, but I am sure there is some POKE that could be done to achieve the same thing.

The Microsoft BASIC ROMs do this trick often, when patching in new routines that override some function.

And now it’s time for a brain break.

Until next time…

Spiraling in Color BASIC and 6809 Assembly – part 2

See also: part 1 and part 2.

In the previous installment, I shared an inefficient BASIC program that could draw a spiral pattern around the screen at whatever location and size was specified. Since the program was not very efficient, I then shared an improved version that ran almost three times faster. This is what it looked like:

YouTube video of spiralbas2.bas

Using this type of spiral pattern would make a nice transition between a title or high score screen and the actual game screen. It would be useful to have a reverse spiral that started with a solid color screen and spiraled outward to reveal the screen, but that is something for the future.

For now, I wanted to explain why the original BASIC code was written so oddly. It was written so oddly because this was not originally BASIC code. I wrote the routine in assembly, then back-ported it to BASIC. Some of you may remember the time I took one of my old BASIC programs and ported it to C. Yeah, this is kinda like that. But different.

The routine in assembly language seems quite a bit faster :-)

YouTube video of spiral.asm

In 6809 assembly, the main registers that are used include two 8-bit registers (A and B) and two 16-bit registers (X and Y). There are not enough registers to serve as all the variables needed for this program, so I made use of memory – storing values then retrieving them later. Much like my BASIC version, this assembly is not as good as it should be. Ideally, it should be routine where you load a few registers, then call the function, such as:

ldx #1024 ; start screen position
lda #32   ; width
ldb #16   ; height
bsr spiral

But I also wanted to specify the character (color) to use for the spiral, and I was out of registers. Thus, memory locations.

I used the RMB statement to remember two bytes in memory after the program:

XSTEPS rmb 1
YSTEPS rmb 1

This let me load the X and Y steps (width and height) of the spiral to draw in those memory locations, so the routine only needed a register for the character/color, and another pointing to the starting position:

 ldx #1024  ; point X to starting screen position
 lda #32    ; width...
 sta XSTEPS ; stored at XSTEPS
 lda #16    ; height...
 sta YSTEPS ; stored at YSTEPS
 ldb #255   ; b is color/character to use
 bsr right  ; start of spiral routine

I think I may redo it at some point, and use just one memory location for the color/character, then use registers A and B for the width and height. Looking at this now, that seems a bit cleaner.

But I digress…

Here is the 6809 assembly code I came up with, with the BASIC version included as comments so you can compare:

* lwasm spiralasm.asm -fbasic -ospiralasm.bas --map

    org $3f00

start:
 ldx #1024      * 10 CLS
 lda #96
 ldb #96
clearloop
 std ,x++
 cmpx #1536
 bne clearloop

                * 15 ' X=START MEM LOC
 ldx #1024      * 20 X=1024

                * 25 ' XS=XSTEPS (WIDTH)
 lda #32        * 30 XS=32
 sta XSTEPS
                * 35 ' YS=YSTEPS (HEIGHT)
 lda #16        * 40 YS=16
 sta YSTEPS
                * 45 ' B=CHAR TO POKE
 ldb #255       * 50 B=255
 bsr right      * 60 GOSUB 100

 ldx #1024      * 70 X=1024
 lda #18        * 71 XS=18
 sta XSTEPS
 lda #8         * 72 YS=8
 sta YSTEPS
 ldb #175       * 73 B=175 '143+32
 bsr right      * 74 GOSUB 100

 ldx #1294      * 75 X=1294 '1024+14+32*8
 lda #18        * 76 XS=18
 sta XSTEPS
 lda #8         * 77 YS=8
 sta YSTEPS
 ldb #207       * 78 B=207 '143+64
 bsr right      * 79 GOSUB 100

 ldx #1157      * 80 X=1157 '1024+5+32*4
 lda #22        * 81 XS=22
 sta XSTEPS
 lda #8         * 82 YS=8
 sta YSTEPS
 ldb #239       * 83 B=239 '143+96
 bsr right      * 84 GOSUB 100

goto            * 99 GOTO 99
    jsr [$a000] * POLCAT ROM routine
    cmpa #3     * break key
    bne goto
    rts

right           * 100 ' RIGHT
 lda XSTEPS     * 110 A=XS
rightloop
 stb ,x         * 120 POKE X,B
 deca           * 130 A=A-1
 beq rightdone  * 140 IF A=0 THEN 170
 leax 1,x       * 150 X=X+1
 bra rightloop  * 160 GOTO 120
rightdone
 leax 32,x      * 170 X=X+32
 dec YSTEPS     * 180 YS=YS-1
 beq done       * 190 IF YS=0 THEN 600

down            * 200 ' DOWN
 lda YSTEPS     * 210 A=YS
downloop
 stb ,x         * 220 POKE X,B
 deca           * 230 A=A-1
 beq downdone   * 240 IF A=0 THEN 270
 leax 32,x      * 250 X=X+32
 bra downloop   * 260 GOTO 220
downdone
 leax -1,x      * 270 X=X-1
 dec XSTEPS     * 280 XS=XS-1
 beq done       * 290 IF XS=0 THEN 600

left            * 300 ' LEFT
 lda XSTEPS     * 310 A=XS
leftloop
 stb ,x         * 320 POKE X,B
 deca           * 330 A=A-1
 beq leftdone   * 340 IF A=0 THEN 370
 leax -1,x      * 350 X=X-1
 bra leftloop   * 360 GOTO 320
leftdone
 leax -32,x     * 370 X=X-32
 dec YSTEPS     * 380 YS=YS-1
 beq done       * 390 IF YS=0 THEN 600

up              * 400 ' UP
 lda YSTEPS     * 410 A=YS
uploop
 stb ,x         * 420 POKE X,B
 deca           * 430 A=A-1
 beq updone     * 440 IF A=0 THEN 470
 leax -32,x     * 450 X=X-32
 bra uploop     * 460 GOTO 420
updone
 leax 1,x       * 470 X=X+1
 dec XSTEPS     * 480 XS=XS-1
 beq done       * 490 IF XS=0 THEN 600

 bra right      * 500 GOTO 100
done
 rts            * 600 RETURN

XSTEPS rmb 1
YSTEPS rmb 1

This experiment made me think about other assembly routines I’ve used, and what they would look like in BASIC. For example, I like to type this one in which will go through every byte of the 32-column text screen and increment it by one. It loops through this making a neat effect:

YouTube video of screeninc.asm

Here is that code:

    org $3f00

start ldx #1024
loop dec ,x+
 cmpx #1536
 bne loop
 bra start
 end

You can even try it yourself right in a web browser:

  1. Go to the online JS Mocha CoCo emulator.
  2. From the center list, select “EDTASM” and then click “Load Bin“. This will load the Microsoft Editor/Assembler for the CoCo.
  3. Once ESTASM 1.0 is loaded, at the “*” prompt, type “I” to go in to input mode. The prompt will change in to line number.
  4. At line number “00100”, type:
    (right arrow for tab)ORG(right arrow)$3F00(enter)
    START(right arrow)LDX(right arrow)#1024(enter)
    LOOP(right arrow)DEC ,X+(enter)
    (right arrow)CMPX(right arrow)#1536(enter)
    (right arrow)BNE(right arrow)LOOP(enter)
    (right arrow)BRA(right arrow)START(enter)
    (right arrow)END(enter)
  5. Exit the editor by pressing ESCape (break key). This returns to the “*” prompt.
  6. Assemble the program by typing “A/IM/WE“. If there are any errors, explaining how editing works in EDTASM is beyond this article, so you could just restart EDTASM and begin again.
  7. If it built with “00000 TOTAL ERRORS”, enter the Z-Bug debugger by typing “Z“. The prompt will change to a “#” symbol.
  8. Run the program by typing “G START“. The screen should do the effect shown in the YouTube video above.
JS Mocha emulator running Microsoft EDTASM+

EDTASM NOTE: The use of tabs (right arrow) is just cosmetic and makes the source code look nice. Instead of doing all the (right arrow) stuff in step #4, you could just type spaces instead. It just wouldn’t look as nice in the listing.

With that tangent out of the way, here is what a literal translation of that short program might look like in Color BASIC:

10 X=1024
20 A=PEEK(X)
30 A=A-1:IF A<0 THEN A=255
40 POKE X,A
50 X=X+1
60 IF X<>1536 THEN 20
70 GOTO 10
80 END

And if you run that, you will see it takes over twelve seconds to go through the screen each time. Thus, assembly code is really the only way to go for this type of thing.

But, if speed is not an issue, translating 6809 assembly to BASIC can certainly be done, at least for simple things like this. But why would one want to?

This example is especially slow because BASIC has no command that replicates the assembly “DEC” operation. DECrement will decrement a register value, or a byte in memory. In this case, “DEC ,X+” say “decrement the byte at location X, then increment X by one.” Thus, replicating that in BASIC takes using the PEEK and POKE commands. Also, when you INCrement or DECrement a byte in assembly, it rolls over at the end. i.e., you can increment 0 all the way up to 255, then incrementing that again rolls over to 0. For decrement, it’s the opposite — start at 255, and decrement until it gets to zero, where a decrement would make it roll over back to 255. In BASIC, subtracting one just ends up making a negative number, so the rollover has to be achieved through the extra code in line 30.

There is more that needs to be done to this spiral routine, but I’ll save that for the future…

Until next time…

Color BASIC RAM hooks – part 2

See Also: part 1 and part 2.

Updates:

  • 2022-08-02 – Minor assembly optimization using “TST” to replace “PSHS B / LDB DEVNUM / PULS B”, contributed by L. Curtis Boyle in the comments.

Since I wrote part 1, I have learned a bit more about using the Color BASIC RAM hooks. One thing I learned is that the BREAK CHECK RAM hook cannot be used to disable BREAK. This is because other parts of the BASIC ROM jump directly to the break check and do not call the RAM hook. Ah, well. If I really need to disable the break key, at least I know how to do it thanks to the 500 POKES, PEEKS ‘N EXECS for the TRS-80 Color Computer book.

I did want to revisit using the CONSOLE OUT RAM hook and do something perhaps almost useful. The MC6487 VDG chip used in the Color Computer lacks true lowercase, and displays those characters as inverse uppercase letters. Starting with later model CoCo’s labeled as “TANDY” instead of “TRS-80”, a new version of the VDG was used that did include true lowercase, but by default, BASIC still showed them as inverse uppercase.

I remembered having a terminal program for my CoCo 1 that would show all text in uppercase. This made the screen easier to read when calling in to a B.B.S. running on a system that had real lowercase. I thought it might be fun to make a quick assembly program that would intercept all characters going to the screen and translate any lowercase letters to uppercase.

Let’s start by looking at the code:

* lwasm consout2.asm -fbasic -oconsout2.bas --map

* Convert any lowercase characters written to the
* screen (device #0) to uppercase.

DEVNUM equ $6f
RVEC3 equ $167      console out RAM hook

    org $3f00

init
    lda RVEC3       get op code
    sta savedrvec   save it
    ldx RVEC3+1     get address
    stx savedrvec+1 save it

    lda #$7e        op code for JMP
    sta RVEC3       store it in RAM hook
    ldx #newcode    address of new code
    stx RVEC3+1     store it in RAM hook

    rts             done

newcode
    * Do this only if DEVNUM is 0 (console)
    *pshs b          save b
    *ldb DEVNUM      get device number
    *puls b          restore b
    tst DEVNUM      is DEVNUM 0?          
    bne continue    not device #0 (console)
uppercase
    cmpa #'a        compare A to lowercase 'a'
    blt continue    if less than, goto continue
    cmpa #'z        compare A to lowercase 'z'
    bgt continue    if greater than, goto continue
    suba #32        a = a - 32
continue
savedrvec rmb 3     call regular RAM hook
    rts             just in case...

The first thing to point out are the EQUates at the start of the code. They are just labels for two locations in BASIC memory we will be using: The CONSOLE OUT RAM hook entry, and the DEVNUM device number byte. DEVNUM is used by BASIC to know what device the output is going to.

Device Numbers

Devices include:

  • -3 – used by the DLOAD command in CoCo 1/2 Extended Color BASIC
  • -2 – printer
  • -1 – casette
  • 0 – screen and keyboard
  • 1-15 – disk

The BASIC ROM will set DEVNUM to the device being used, and routines use that to know what to do with the date being written. For example:

Device 0?

Device #0 may seem unnecessary, since it is assumed if #0 is not present:

PRINT "THIS GOES TO THE SCREEN"
PRINT #0,"SO DOES THIS"

Or…

10 INPUT "NAME";A$
10 INPUT #0,"NAME";A$

But, it is very useful if you are writing code that you want to be able to output to the screen, a printer, a cassette file, or disk file. For example:

10 REM DEVICE0.BAS
20 DN=0
30 PRINT "OUTPUT TO:"
40 PRINT"S)CREEN, T)APE OR D)ISK:"
50 A$=INKEY$:IF A$="" THEN 50
60 LN=INSTR("STD",A$)
70 ON LN GOSUB 100,200,300
80 GOTO 30

100 REM SCREEN
110 DN=0:GOSUB 400
120 RETURN

200 REM TAPE
210 PRINT "SAVING TO TAPE"
220 OPEN"O",#-1,"FILENAME"
230 DN=-1:GOSUB 400
240 CLOSE #-1
250 RETURN

300 REM DISK
310 PRINT "SAVING TO DISK"
320 OPEN"O",#1,"FILENAME"
330 DN=1:GOSUB 400
340 CLOSE #1
350 RETURN

400 REM OUTPUT HEADER TO DEV
410 PRINT #DN,"+-----------------------------+"
420 PRINT #DN,"+   SYSTEM SECURITY REPORT:   +"
430 PRINT #DN,"+-----------------------------+"
440 RETURN

That is a pretty terrible example, but hopefully shows how useful device number 0 can be. In this case, the routine at 400 is able to output to tape, disk or screen (though in the case of tape or disk, code must open/create the file before calling 400, and then close it afterwards).

Installing the new code.

The code starts out by saving the three bytes currently in the RAM hook:

init
    lda RVEC3       get op code
    sta savedrvec   save it
    ldx RVEC3+1     get address
    stx savedrvec+1 save it

The three bytes are saved elsewhere in program memory, where they are reserved using the RMB statement in the assembly source:

savedrvec rmb 3     call regular RAM hook

More on that in a moment. Next, the RAM hook bytes are replaced with three new bytes, which will be a JMP instructions (byte $7e) and the two byte location of the “new code” routine in memory:

    lda #$7e        op code for JMP
    sta RVEC3       store it in RAM hook
    ldx #newcode    address of new code
    stx RVEC3+1     store it in RAM hook

    rts             done

There is not much to it. As soon as this code executes, the Color BASIC ROM will start calling the “newcode” routine every time a character is being output. After that RAM hook is done, the ROM continues with outputting to whatever device is selected.

Color BASIC came with support for screen, keyboard and cassette.

Extended BASIC used the RAM hook to patch in support for the DLOAD command (which uses device #-3).

Disk BASIC used the RAM hook to patch in support for disk devices.

And now our code uses the RAM hook to run our new code, and then we will call whatever was supposed to be there (which is why we save the 3 bytes that were in the RAM hook before we change it).

Now we look at “newcode” and what it does.

Most printers print lowercase.

Since a printer might print lowercase just fine, our code will not want to uppercase any output going to a printer. Likewise, we may want to write files to tape or disk using full upper or lowercase. Also, you can save binary data to a file on tape or disk. Translating lowercase characters to uppercase would be a bad thing if the characters being sent were actually supposed to be raw binary data.

Thus, DEVNUM is needed so the new code will ONLY translate if the output is going to the screen (device #0). That’s what happens here:

newcode
    * Do this only if DEVNUM is 0 (console)
    tst DEVNUM      is DEVNUM 0? 
    bne continue    not device #0 (console)

If that value at DEVNUM is not equal to zero, the code just skips the lowercase-to-uppercase code.

uppercase
    cmpa #'a        compare A to lowercase 'a'
    blt continue    if less than, goto continue
    cmpa #'z        compare A to lowercase 'z'
    bgt continue    if greater than, goto continue
    suba #32        a = a - 32

For characters going to device #0, A will be the character to be output. This code just looks at the value of A and compares it to a lowercase ‘a’… If lower, skip doing anything else. If it wasn’t lower, it then compares it to lowercase ‘z’. If higher, skip doing anything. Only if it makes it past both checks does it subtract 32, converting ‘a’ through ‘z’ to ‘A’ through ‘Z’.

Lastly, when we are done (either converting to uppercase, or skipping it because it was not the screen), we have this:

continue
savedrvec rmb 3     call regular RAM hook
    rts             just in case...

The double labels — continue and savedrvec — will be at the same location in memory. I just had two of them so I was brancing to “continue” so it looked better than “bra savedrvec”, or better than saving the vector bytes as “continue”.

By having those three remembered (RMB) bytes right there, whatever was in the original RAM hook is copied there and it will now be executed. When it’s done, we RTS back to the ROM.

When this code is built and ran, it immediately starts working. Here is a BASIC loader that will place this code in memory:

10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 END
80 DATA 16128,16165,182,1,103,183,63,38,190,1,104,191,63,39,134,126,183,1,103,142,63,24,191,1,104,57,13,111,38,10,129,97,45,6,129,122,46,2,128,32,16169,16169,57,-1,-1

If you RUN that code, you can then to a CLEAR 200,&H3F00 to protect it from BASIC, and then EXEC &H3F00 to initialize it. Nothing will appear to happen, but if you try to do something like this:

PRINT "Lowercase looks weird on a CoCo"

…you will see “LOWERCASE LOOKS WEIRD ON A COCO”. To test it further, switch to lowercase (SHIFT-0 on a real CoCo, or SHIFT-ENTER on the XRoar emulator) and type a command like LIST.

If the code is working, it should still type out as “LIST” on the screen, and then give a “?SN ERROR” since it’s really lowercase, and BASIC does not accept lowercase commands.

Neat, huh?

No going back.

One warning: There is no uninstall routine. Once it’s installed, do not run it again or it will replace the modified RAM hook (that points to “newcode”) with a new RAM hook (which points to “newcode”) and then at the end of the newcode routine it will then jump to the saved RAM hook that points to “newcode”. Enjoy life in the endless loop!

To make this a better patch, ideally the code should also reserve a byte that represents “is it already installed” and check that first. The first time it’s installed, that byte will get set to some special value. If it is ran again, it checks for that value first, and only installs if the value is uninitialized. It’s not perfect, but it would help prevent running this twice.

An uninstall could also be written, which would simple restore the savedrvec bytes back in the original RAM hook.

But I’ll leave that as an exercise for you, if you are bored.

Until next time…

Color BASIC RAM hooks – part 1

See Also: part 1 and part 2.

7/27/2022 NOTE: William Astle left a comment pointing out that my RAM hook example should be using a JSR instead of a JMP. This has not been updated or corrected yet, but I will.

When the CoCo was released in 1980, it came with an 8K ROM containing Color BASIC. If the CoCo was expanded to at least 16K of RAM, a second 8K ROM could be added which contained Extended Color BASIC. If a disk controller were plugged in, that controller added a third 8K ROM containing Disk Extended Color BASIC.

Each ROM provided additional features and commands to what was provided in the original 8K Color BASIC.

To allow this, Color BASIC has a series of RAM hooks that initially point to routines inside Color BASIC, but can be modified by additional ROM code to point somewhere else. For example, Extended Color BASIC would modify these RAM hooks to point to new routines provided by that ROM.

According to the disassembly in Color BASIC Unravelled, there are 25 RAM hooks starting in memory at $15E (350). Each hook is 3 bytes long, which allows it to contain a JMP instruction followed by a 16-bit address.

As an example, there is a vector for “CONSOLE OUT” at $167 (359). On a Color BASIC system, that RAM hook contains $39 $39 $39 (57 57 57). That is the opcode for an RTS instuction, so effectively it looks like this:

rts
rts
rts

In Color BASIC is a subroutine called PUTCHR. Any time BASIC wants to output a character to the device (such as the screen), it loads the character in register A then calls this routine. Here is an example that outputs a question mark:

LB9AF  LDA  #'?     QUESTION MARK TO CONSOLE OUT
LB9B1  JMP  PUTCHR  JUMP TO CONSOLE OUT

The first thing this PUTCHR routine does is JMP to the RAM hook location, which is named RVEC3 (RAM hook vector 3) to run any extra code that might be needed. It looks like this:

* CONSOLE OUT
PUTCHR JSR RVEC3
       ...rest of routine...
       RTS

Since Color BASIC just had “RTS” there, PUTCHR would JMP to the RAM hook bytes and immediately return back, then continue with the output.

The code in Color BASIC knows about outputting to several device numbers. Device #0 (default) is the screen. Device #-1 is the cassette. Device #-2 is the printer.

When Extended Color BASIC came along, it added device #-3 for use with the DLOAD command. (I don’t think I ever knew this, but I did use DLOAD to download my first CoCo terminal program.) Since Color BASIC knew nothing about this device number, these RAM hooks were used to add the new functionality,

Extended Color BASIC modifies the three bytes of this RAM hook to be $7e $82 $73. That represents:

jmp $8273

This jumps to a routine in the Extended Color BASIC ROM called XVEC3 (Extended Vector 3?). This is new code to check to see if we are using the DLOAD command (which outputs over the serial port, but not as a printer).

* CONSOLE OUT RAM HOOK
XVEC3 TST  DEVNUM   CHECK DEVICE NUMBER
      LBEQ L95AC    BRANCH IF SCREEN
      PSHS B        SAVE CHARACTER
      LDB  DEVNUM   *GET DEVICE NUMBER AND
      CMPB #-3      *CHECK FOR DLOAD
      PULS B        GET CHARACTER BACK
      BNE  L8285    RETURN IF NOT DLOAD
      LEAS $02,S    *TAKE RETURN OFF STACK & GO BACK TO ROUTINE
                    *THAT CALLED CONSOLE OUT
      RTS

When Disk Extended Color BASIC is added, the vector is modified to point to a new routine called DVEC3 (Disk Vector 3?) located at $cc1c. (For Disk BASIC 1.0, it is at $cb4a.) That code will test to see if we are outputting to a disk device and, if not, it will long branch to XVEC3 in the Extended ROM. I find this curious since it would seem to imply that this location could never change, else Disk BASIC would break.

DVEC3 TST  DEVNUM   CHECK DEVICE NUMBER
      LBLE XVEC3    BRANCH TO EX BASIC IF NOT A DISK FILE
      ...rest of routine...
      RTS

Thus, with Color, Extended and Disk BASIC ROMs installed, a program wanting to output a character, such as “?”, is doing something like this:

LDA #'?
JMP PUTCHR (Color BASIC ROM)
 \
   PUTCHR JSR RVEC3 (RAM hook)
    \
      RVEC3 JMP DVEC3 (in Disk BASIC)
       \
         DVEC3 TST DEVNUM
         LBLE XVEC3 (in Extended BASIC)
           \
             XVEC3 ...handle ECB
           /
         ...handle rest of DECB...
       /
     /
   ...handle rest of CB...

…or something like that. There’s alot of branching and jumping going on, to be sure.

So what?

This means we should also be able to make use of these RAM hooks and patch our own code in to the process. Since the only thing we can alter is the hook itself, our code has to save the original RAM hook, then point the hook to our new code. At the end of our new code, we jump to where the original RAM hook went, allowing normal operations to continue. (Or, we could have made it jump to the ROM hook first, and after the normal ROM things are done, then run our new code…)

To test this, I made a simple assembly routine to hijack the CONSOLE OUT RAM hook located at $167.

RVEC3 equ $167      console out RAM hook

    org $3f00

init:
    lda RVEC3       get op code
    sta saved       save it
    ldx RVEC3+1     get address
    stx saved+1     save it
    
    lda #$7e        op code for JMP
    sta RVEC3       store it in RAM hook
    ldx #new        address of new code
    stx RVEC3+1     store it in RAM hook

    rts             done

new:
    inc $400
saved rmb 3

In the init routine, the first thing we do is load the first byte (which would be an RTS or a JMP) from the RAM hook and save it in a 3-byte buffer.

We then load the next two bytes (which would be a 16-bit address, or RTS RTS for Color BASIC).

Next we load A with the value of a JMP op code ($7e). We store it in the first byte of the RAM hook vector.

We then do the same thing for the two byte address.

The RTS is the end of the code which hijacked the RAM hook. We have now pointed the RAM hook to our “new” routine.

At new, all we do is increment whatever is in memory location $400 (the top left character of the 32-column screen). Right after that INC is our 3-byte “saved” buffer where the three bytes that used to be in the RAM hook are saved. This make our code do the INC and then the same three bytes that would have been done by the original RAM hook before we hijacked it.

On Color BASIC, the RAM hook starts out with 57 57 57 (RTS RTS RTS), so the new routine would appear as:

new:
   inc $4000
   rts
   rts
   rts

For Extended Color BASIC, where the RAM hook is turned in to JMP $827e, it would become:

new:
   inc $4000
   jmp $827e

When we EXEC the init code, out INC routine is patched in. From that point on, any output causes the top left character of the screen to increment. This will happen for ANY output, even to a printer or cassette file, since this code does not bother checking for the device type.

Here is a BASIC loader program, generated by LWTOOLS:

10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 END
80 DATA 16128,16154,182,1,103,183,63,27,190,1,104,191,63,28,134,126,183,1,103,142,63,24,191,1,104,57,124,4,0,-1,-1

Load that in to a CoCo (or emulator), and RUN it to get the code in memory starting at $3f00.

After RUN, you should be able to EXEC &H3f00 and the hook will be installed.

Now, for something real, you’d want to find a safe place to store the new hook code. At the very least, we should have a CLEAR 200,&H3f00 or similar in this program to ensure BASIC doesn’t overwrite the assembly code. This should be enough for a simple proof-of-concept.

The RAM hooks we have available include:

  • OPEN COMMAND
  • DEVICE NUMBER VALIDITY CHECK
  • SET PRINT PARAMETERS
  • CONSOLE OUT
  • CONSOLE IN
  • INPUT DEVICE NUMBER CHECK
  • PRINT DEVICE NUMBER CHECK
  • CLOSE ALL FILES
  • CLOSE ONE FILE
  • PRINT
  • INPUT
  • BREAK CHECK
  • INPUTTING A BASIC LINE
  • TERMINATING BASIC LINE INPUT
  • EOF COMMAND
  • EVALUATE AN EXPRESSION
  • RESERVED FOR ON ERROR GOTO CMD
  • ERROR DRIVER
  • RUN
  • ASCII TO FLOATING POINT CONV.
  • BASIC’S COMMAND INTERP. LOOP
  • RESET/SET/POINT COMMANDS
  • CLS
  • SECONDARY TOKEN HANDLER
  • RENUM TOKEN CHECK
  • EXBAS’ GET/PUT
  • CRUNCH BASIC LINE
  • UNCRUNCH BASIC LINE

There are many possibilities here, including adding new commands.

Until next time…

Color BASIC Attract Screen – part 6

See also: part 1, part 2, part 3, part 4, unrelated, and part 5.

War by James Garon

Just as I thought I had reached the conclusion to my epic masterpiece about classic CoCo game startup screens, Robert Gault made this post to the Color Computer mailing list:

Robert Gault robert.gault at att.net
Sat Jul 2 09:56:21 EDT 2022

An author of games for Tandy, James Garon, put some lines in the Basic game WAR which is on colorcomputerarchive.com . The code is in lines 60000 and up which can’t be LISTed with a CoCo3 but can be read with a CoCo2.  The lines in question contain PRINT commands which when listed with a CoCo2, look like the data inside the quotes have been converted into Basic commands. You can also see this if you use imgtool or wimgtool to extract the program from the .dsk image.  These lines generate PMODE graphics for the title screen.

Do anyone have a clue as to how these Basic lines work?

– Robert Gault, via CoColist

The game in question is available in tape or disk format from the Color Computer Archive:

https://colorcomputerarchive.com/search?q=War+%28Tandy%29&ww=1

And, you can even click the “Play Now” button and see it run right in your web browser. Click below and take a look at the title screen:

https://colorcomputerarchive.com/test/xroar-online/?machine=cocous&basic=RUN%22WAR%22%5cr&cart=rsdos&disk0=/unzip%3Ffile%3DDisks/Games/War%20(Tandy).zip/WAR.DSK

War by James Garon (title screen)

This title screen uses the CoCo’s lesser-known screen color, and has those iconic rotating color blocks.

And, it’s in BASIC! Well, almost. It contains an assembly language routine that rotates those colors, and it does it the same way I figured out in this article series. I am just thirty years too late with my solution.

Line 60000

60000 CLS:A$=STRING$(28,32):PRINT"RESTORERESTOREMOTORMOTOR^^SCREENSCREENDRIVEDRIVEDSKI$DSKI$!!!RESTORERESTOREMOTORMOTOR^^SCREENSCREENDRIVEDRIVEDSKI$DSKI$!!!";

As Robert mentioned, there’s some weirdness in the program starting at line 60000. When you LIST it, you get a rather odd output full of BASIC keywords and such:

WAR.BAS line 60000

The first bit looks okay… CLS to clear the screen, then A$ is created to be 28 spaces — CHR$(32). But that PRINT looks a bit … weird.

I’ve seen this trick before, and even mentioned it recently when talking about typing in an un-typable BAISC pogram. The idea is you can create a program that contains something in a string or PRINT statement, like:

PRINT "1234567890"

…and then you alter the bytes between those quotes somehow, such as locating them and using POKE to change them. Let’s figure out an easy way to do this.

First, I’ll start with a print statement that has characters I can look for later, such as the asterisk. I like that because it is ASCII 42. If you don’t know why I like forty-two, you must be new here. This wikipedia page will give you the details…

100 PRINT "**********"

Now we need to alter those bytes and change them to something we couldn’t normally type, such as graphics blocks (characters 128-255). To do this, we can use some code that scans from the start of the BASIC program to the end, and tries to replace the 42s with something else.

This is dangerous, since 42 could easily appear in a program as part of a keyword token or line number or other data. But, I know that a quote is CHR$(34), so I could specifically look for a series of 42s that is between two 34s.

Numbers.

So many numbers.

In Color BASIC, memory locations 25 and 26 contains the start of the BASIC program in memory. Locations 27 and 28 is the end of the program. Code to scan that range of memory looking for a quote byte (34) that has an asterisks byte (42) after it might look like this:

0 ' CODEHACK.bas
10 S=PEEK(25)*256+PEEK(26)
20 E=PEEK(27)*256+PEEK(28)
30 L=S
40 V=PEEK(L)
50 IF V=34 THEN IF PEEK(L+1)=42 THEN 80
60 L=L+1:IF L<E THEN 40
70 END
...

We could then add code to change all the 42s encountered up until the next quote (34).

80 L=L+1:IF PEEK(L)=34 THEN END
90  POKE L,128:GOTO 80
100 PRINT "**********"

Line 80 moves to the next byte and will end if that byte is a quote.

In line 90, it uses POKE to change the character to a 128 — a black block. It continues to do this until it finds the closing quote.

If you load this program and LIST it, it looks like the code shown. But after you RUN, listing it reveals a garbled PRINT statement similar to the WAR line 60000. But, if you run that garbled PRINT statement, you get the output of black blocks:

CoCo code hack!

As you can see, line 100 changes ten asterisks to the keyword FOR ten times. I am guessing that the numeric token for “FOR” is 128.

Color BASIC Unravelled

…and my guess was correct! According to Color BASIC Unravelled, the token for FOR is hex 80 — which is 128. Perfect. So the BASIC “LIST” routine is dump, and tries to detokenize things even if they are surrounded by quotes. Interesting.

At this point, if you were to EDIT that line, it would detokenize it to be…

100 PRINT "FORFORFORFORFORFORFORFORFORFOR"

…and if you saved your edit, you’d now have a line that would print exactly that, no longer ten black blocks for the word FOR ten times as if you’d typed them all in.

This makes these changes uneditable. BUT, once the modification code has been ran, you can delete it, then SAVE/CSAVE the modified program. When it loads back up, it will have those changes.

In the case of WAR line 60000, it’s a PRINT that shows a series of color blocks used for the top of the attract screen.

Here is the garbled output of lines 60000 on in the WAR.BAS program:

60000 CLS:A$=STRING$(28,32):PRINT"RESTORERESTOREMOTORMOTOR^^SCREENSCREENDRIVEDRIVEDSKI$DSKI$!!!RESTORERESTOREMOTORMOTOR^^SCREENSCREENDRIVEDRIVEDSKI$DSKI$!!!";
60005 FORI=1TO8:GOSUB60028:NEXT:FORI=1TO6:GOSUB60028:NEXT
60010                             PRINT"MOTORMOTORRESTORERESTORE!!!DSKI$DSKI$DRIVEDRIVESCREENSCREEN^^MOTORMOTORRESTORERESTORE!!!DSKI$DSKI$DRIVEDRIVESCREENSCREEN^";
60020 POKE1535,175:T$="WAR!":PRINT@99,"A YOUNG PERSON'S CARD GAME";:PRINT@80-LEN(T$)/2,T$;:PRINT@175,"BY";:PRINT@202,"JAMES  GARON";:PRINT@263,"COPYRIGHT (C) 1982";:PRINT@298,"DATASOFT INC.";:PRINT@389,"LICENSED TO TANDY CORP.";:SCREEN0,1
60025 GOSUB60030:I$=INKEY$:FORI=1TO300:FORJ=1TO30:NEXT:IFINKEY$=""THENEXECV:NEXT:RETURNELSERETURN
60028 PRINTSTRING$(2,127+16*(9-I))TAB(30)STRING$(2,127+16*I);:RETURN
60030 A$="RUN!SUBELSE,NEXTENDFORTHENFORDIM/!9"
60060 V=VARPTR(A$):V=PEEK(V+2)*256+PEEK(V+3):RETURN

Line 60005 does two FOR/NEXT loops and both GOSUB 600028, so we’ll look at that next.

Line 60028

60028 PRINTSTRING$(2,127+16*(9-I))TAB(30)STRING$(2,127+16*I);:RETURN

This routine prints solid colored blocks on the left and right side of the screen, based on the value of I.

WAR.BAS line 60028

The two FOR loops are used to fill the entire screen. There are only 8 colors, so you can’t just do one loop from 1 to 14.

Line 60010

60010                             PRINT"MOTORMOTORRESTORERESTORE!!!DSKI$DSKI$DRIVEDRIVESCREENSCREEN^^MOTORMOTORRESTORERESTORE!!!DSKI$DSKI$DRIVEDRIVESCREENSCREEN^";

Look familiar? It’s just like the PRINT in line 60000, though the pattern of the blocks is different, and it is only printing 31. Since the one prints on the bottom of the screen, if it printed all the way to the bottom right position, the screen would scroll up one line.

WAR.BAS line 60010

Line 60020

60020 POKE1535,175:T$="WAR!":PRINT@99,"A YOUNG PERSON'S CARD GAME";:PRINT@80-LEN(T$)/2,T$;:PRINT@175,"BY";:PRINT@202,"JAMES  GARON";:PRINT@263,"COPYRIGHT (C) 1982";:PRINT@298,"DATASOFT INC.";:PRINT@389,"LICENSED TO TANDY CORP.";:SCREEN0,1

More normal code… The POKE 1535,175 is what fills in the bottom right block of the screen, where PRINT did not go to. 175 is a blue block.

After this are just normal PRINT@ statements to put the text on the screen.

At the end, SCREEN 0,1 puts the CoCo in to its alternate screen color of pink/red/orange/whatever color that is.

Line 60025

60025 GOSUB60030:I$=INKEY$:FORI=1TO300:FORJ=1TO30:NEXT:IFINKEY$=""THENEXECV:NEXT:RETURNELSERETURN

This line is normal code, but the use of EXEC tells us some assembly language is being used.

The first thing it does is GOSUB 60030, then it gets any waiting keypress in I$. I don’t see I$ used so I’m unsure of this purpose.

Next it does two FOR/NEXT loops. The first “FOR I” appears to be the number of times to do this routine. The second “FOR J/NEXT” is just a timing delay.

After this is a direct check for any waiting key by using INKEY$ directly. If no key is pressed, “EXEC V” is done… This would execute whatever machine language routine is loaded in to memory at wherever V is set to. But what is V? That must be the GOSUB 60030, which we will discuss after this.

After the NEXT (FOR I) is “RETURN ELSE RETURN”. That way it does a return whether or not the IF INKEY$ is true. Since either way will RETURN, with nothing executed after this line, this might have also worked (extra spaces added for readability):

60025 GOSUB 60030:I$=INKEY$:FOR I=1 TO 300:FOR J=1 TO 30:NEXT:IF INKEY$=""THEN EXEC V:NEXT
6026 RETURN

But James Garon seems to know his stuff. Each line number takes up 5 bytes of space. The three keywords “RETURN ELSE RETURN” (without spaces) only takes up four (I’m guessing RETURN is a one byte token and ELSE is a two byte token.)

So indeed, the odd “RETURN ELSE RETURN” is less memory than putting one RETURN on the next line. (In my Benchmarking BASIC articles, I’ve focused on speed versus space, so perhaps I’ll have to do a series on “Making BASIC Smaller” some time…)

This leads us to the GOSUB 60030.

Lines 60030-60060

60030 A$="RUN!SUBELSE,NEXTENDFORTHENFORDIM/!9"
60060 V=VARPTR(A$):V=PEEK(V+2)*256+PEEK(V+3):RETURN

Line 60030 creates a string, but the string seems reminiscent of those PRINT lines seen earlier. Since we don’t see anything using this string, it’s probably not for displaying block graphics characters.

We do see the use of VARPTR(A$) on the next line. VARPTR returns the “variable pointer” for a specified variable. I’ve discussed VARPTR in earlier articles, but the Getting Started with Color BASIC manual describes it as:

VARPTR (var) Returns addresses of pointer to the specified variable.

– Getting Started with Color BASIC

When using VARPTR on a numeric (floating point) variable, it returns the location of the five bytes that make up the number, somewhere in variable space.

But strings are different. Strings live in separate string memory (reserved by using the CLEAR command, with a default of 200 bytes), or they could be contained in the program code itself. See my String Theory series for more on that. With a string, the VARPTR points to five bytes that point to where the string data is contained.

In part 3 of my my DEFUSR series, I describe it as:

The first byte where the string is stored will be the size of that string:

A$=”THIS IS A STRING IN MEMORY”
X = VARPTR(A$)
PRINT “A$ IS LOCATED AT”;X
PRINT “A$ IS”;PEEK(X);”LONG”

I forget what the second byte is used for, but bytes three and four are the actual address of the string character data:

PRINT “STRING DATA IS AT”;PEEK(X+2)*256+PEEK(X+3)

– Interfacing assembly with BASIC via DEFUSR, part 3

This looks familiar, since the VARPTR(A$) in this code then gets the address of the string by PEEKing bytes three and four:

60060 V=VARPTR(A$):V=PEEK(V+2)*256+PEEK(V+3):RETURN

This tells me A$ contains a machine language routine. The GOSUB to this routine returns the address of the bytes between the quotes of the A$=”…” then that location is EXECuted to run whatever the routine does.

To figure this one out is much simpler, since no PEEKing of BASIC code is needed. It’s in a string, so we can just print the ASC() value of each byte in the string:

60030 A$="RUN!SUBELSE,NEXTENDFORTHENFORDIM/!9"
60031 FOR I=1 TO LEN(A$):PRINT ASC(MID$(A$,I,1));:NEXT:END

By doing a RUN 60030 I then get a list of bytes that are inside of that string:

142  3  255  48  1  166  132  44  4  139  16  138  128  167  128  140  6  1  47  241  57

I recognize 57 as the op code for an RTS instruction, so this does look like it’s machine code. And while I could use some 6809 data sheet and look up each of those bytes to figure out what they are, I’d rather have something else do the work for me.

Online 6809 Simulator to the rescue!

At www.6809.uk is an incredible 6809 simulator that I recently wrote about. It lets you paste in assembly code and run it in a debugger, showing the registers, op codes, etc. To get these bytes in to the emulator, I just turned them in to a stream of DATA using the “fcb” command in the simulator’s assembler:

routine fcb 142,3,255,48,1,166,132,44
        fcb 4,139,16,138,128,167,128,140,6
        fcb 1,47,241,57

By pasting that in to the simulator’s “Assembly language input” box and then clicking “Assemble source code“, the code is built and then displays on the left side in the debugger, complete with op codes:

6809.uk

Now I can see the code is:

4000:	8E 03 FF      	LDX #$03FF
4003:	30 01         	LEAX $01,X
4005:	A6 84         	LDA ,X
4007:	2C 04         	BGE $400D
4009:	8B 10         	ADDA #$10
400B:	8A 80         	ORA #$80
400D:	A7 80         	STA ,X+
400F:	8C 06 01      	CMPX #$0601
4012:	2F F1         	BLE $4005
4014:	39            	RTS

Since I did not give any origination address for where this code should go, the simulator used hex 4000. There are two branch instructions that refer to memory locations, so I’ll change those to labels and convert this just to the source code. I’ll even include some comments:

    LDX #$03FF  * X points to 1023, one byte before screen memory
    LEAX $01,X  * Increment X by 1 so it is now 1024
L4005
    LDA ,X      * Load A with the byte pointed to by X
    BGE L400D   * If that byte is not a graphics char, branch to L400d
    ADDA #$10   * Add hex 10 (16) to the character, next color up
    ORA #$80    * Set the high bit (in case value rolled over)
L400D
    STA ,X+     * Store (possibly changed) value back at X and increment X.
    CMPX #$0601 * Compare X to two bytes past end of screen
    BLE L4005   * If X is less than that, branch to L4005
    RTS         * Return

This code scans each byte of the 23 column screen. If the character there has the high bit set, it is a graphics character (128-255 range). It adds 16 to the value, which moves it to the next color (16 possible combinations of a 2×2 character block for 8 colors). It stores the value back to the screen (either the original, or one that has been shifted) and then increments X to the next screen position. If X is less than two bytes past the end of the screen, it goes back and does it again.

Hmm, it seems this routine would actually write to one byte past the end of the screen memory if it contained a value 128-255. Hopefully nothing important is stored there. (Am I right?)

And, for folks more used to BASIC, here is the same code with BASIC-style comments:

    LDX #$03FF  * X=&H3FF (1023, byte before screen)
    LEAX $01,X  * X=X+1
L4005
    LDA ,X      * A=PEEK(X)
    BGE L400D   * IF A<128 GOTO L400D
    ADDA #$10   * A=A+&H10:IF A>&HFF THEN A=A-&HFF
    ORA #$80    * A=A OR &H80
L400D
    STA ,X+     * POKE X,A:X=X+1
    CMPX #$0601 * Compare X to &H601 (two bytes past end)
    BLE L4005   * IF X<=&H601 GOTO L4005
    RTS         * RETURN

My BASIC-style comments don’t exactly match what happens in assembly since. For example, “BGE” means “Branch If Greater Than”, but there was no compare instruction before it. In this case, it’s branching based on what was just loaded in to the A register, and would be compared to 0. That looks odd, but BGE is comparing the value based on it being signed — instead of the byte representing 0-255, it represents -127 to 128, with the high bit set to indicate the value is negative. So, if the high bit is set, it’s a negative value, and the BGE would NOT branch. Fun.

If you run this BASIC program, it will BASICally do the same thing … just much slower:

100 X=&H3FF
110 X=X+1
120 A=PEEK(X)
130 IF A<128 GOTO 160
140 A=A+&H10:IF A>&HFF THEN A=A-&HFF
150 A=A OR &H80
160 POKE A,X:X=X+1
170 'Compare X to &H601
180 IF X<=&H601 GOTO 120
190 RETURN

So that is the magic to how this attract screen works so fast. It uses POKEd PRINT statements to quickly print the top and bottom of the screen, FOR/NEXT loops to print the sides, then this assembly routine to shift the colors and make them rotate around the screen.

And, this code is very similar to the routine I came up with earlier in this series:

start
    ldx #1024   X points to top left of 32-col screen
loop
    lda ,x+     load A with what X points to and inc X
    bpl skip    if not >128, skip
    adda #16    add 16, changing to next color
    ora #$80    make sure high gfx bit is set
    sta -1,x    save at X-1
skip
    cmpx #1536  compare X with last byte of screen
    bne loop    if not there, repeat
    sync        wait for screen sync
    rts         done

My routine starts with X at 1024, then uses BPL instead of BGE. I also increment X after I load the character, and I only update it if it gets modified by storing it 1 before where X is then.

At first, I thought mine was more clever. But I decided to see what LWASM said.

Counting cycles for fun and profit. Or just more speed.

LWASM can be made to display a listing of the compiled program and, optionally, list how many cycles each line will take. And, optionally optionally, keep a running total of cycles between places in the code you mark. (I have another article about how to use this.)

Here is my version, with (cycles) in parens, and running totals in the column next to it.

        start
(3)         ldx #1024
                loop
(5)     5           lda ,x+
(3)     8           bpl skip
(2)     10          adda #16
(2)     12          ora #$80
(5)     17          sta -1,x
                skip
(3)     20          cmpx #1536
(3)     23          bne loop
(4)         rts

My routine uses 23 cycles from “loop” to “bne loop”.

Here is the routine from WAR.BAS:

(3)         LDX #$03FF
(5)         LEAX $01,X
                L4005
(4)     4           LDA ,X 
(3)     7           BGE L400D
(2)     9           ADDA #$10
(2)     11          ORA #$80
                L400D
(5)     16          STA ,X+
(3)     19          CMPX #$0601
(3)     22          BLE L4005
(4)         RTS

This one appears to use one cycle less — 22 — in it’s loop. Nice! Even though I thought it would be worse, once again, James Garon appears to know his stuff.

I think his routine may be a bit larger, and I wondered why he started X one byte before the screen memory and then incremented it. That seems wasteful.

However, William “Lost Wizard” Astle saw exactly why in a reply on the CoCo mailing list:

He uses that sequence instead of the more obvious one to avoid having a NUL byte in the code. A NUL would cause the interpreter to think it’s the end of the line and break things.

– William Astle via CoCo mailing list

It took me a moment, but I think I understand now. If he had done “LDA #$4000”, the byte sequence would have been whatever byte is LDA, followed by $04 and $00. You can’t put a 0 in a string, or BASIC will think that is the end of the string. Any assembly encoded this way must avoid using a zero. This is also the reason he doesn’t compare to the byte past the end of the screen, which is $6000. Though, I expect comparing to 1535 and using “Branch if less than OR equal to” would have worked and avoided the zero.

But James Garon knows his stuff, so I had to see if my way was larger or slower:

(3)         cmpx #1535
(3)         ble loop
        
(3)         CMPX #$0601
(3)         BLE L4005

Well, they look the same speed, and neither generates a zero byte in the machine code. I don’t know why he does it that way.

Any thoughts?

BONUS!

If one were to patch the Color BASIC “UNCRUNCH” routine to show things between quotes, here is what those lines would look like… (And if one did such a patch, I expect they’d be writing a future article about it…)

Conclusion

For some reason, James Garon chose to embed assembly code and graphics characters like this, rather than using DATA statements and building strings or POKEing assembly bytes in to memory somewhere.

It’s cool to see. But unless someone knows James Garon, I guess we don’t know why this method was done.

Other than “because it’s cool,” which is always a good reason to do something when programming.

Until next time…

Color BASIC Attract Screen – part 5

See also: part 1, part 2, part 3, part 4, unrelated, and part 5.

In part 4 of this series, Jason Pittman provided several variations of creating the attract screen:

Jason Pittman variation #1

If those four corners bother you, then my attempt will really kick in that OCD when you notice how wonky the colors are moving…

Jason Pittman
10 CLS0:C=143:PRINT@268,"ATTRACT!";
20 FOR ZZ=0TO1STEP0:FORX=0TO15:POKEX+1024,C:POKEX+1040,C:POKE1535-X,C:POKE1519-X,C:POKE1055+(32*X),C:POKE1472-(32*X),C:GOSUB50:NEXT:GOSUB50:NEXT
50 C=C+16:IF C>255 THEN C=143
60 RETURN

Jason Pittman variation #2

Also, another option using the substrings might be to fill the sides by printing two-character strings on the 32nd column so that a character spills over to the first column of the next line:

Jason Pittman
10 CLS 0:C=143:OF=1:CH$=""
20 FOR X=0TO40:CH$=CH$+CHR$(C):GOSUB 90:NEXT
30 FOR ST=0TO1STEP0
40 PRINT@0,MID$(CH$,OF,31):GOSUB 120
50 FORX=31TO480STEP32:PRINT@X,MID$(CH$,OF,2);:GOSUB 120:NEXT
60 PRINT@481,MID$(CH$,OF,30);:GOSUB120
70 NEXT
80 REM ADVANCE COLOR
90 C=C+16:IF C>255 THEN C=143
100 RETURN
110 REM ADVANCE OFFSET
120 OF=OF+2:IF OF>7 THEN OF=OF-8
130 RETURN

Jason Pittman variation #3

One more try at O.C.D-compliant “fast”:

Jason Pittman
10 DIM CL(24):FORX=0TO7:CL(X)=143+(X*16):CL(X+8)=CL(X):CL(X+16)=CL(X):NEXT
20 CLS0:FORXX=0TO1STEP0:FORYY=0TO7:FORZZ=1TO12:READPO,CT,ST,SR:FOR X=SRTOSR+CT-1:PO=PO+ST:POKE PO,CL(X+YY):NEXT:NEXT:RESTORE:NEXT:NEXT
180 REM POSITION,COUNT,STEP,START
190 DATA 1024,8,1,0,1032,8,1,0,1040,8,1,0,1048,6,1,0,1055,8,32,6,1311,6,32,6,1535,8,-1,4,1527,8,-1,4,1519,8,-1,4,1511,6,-1,4,1504,8,-32,2,1248,6,-32,2

The #3 variation using DATA statements is my favorite due to its speed. Great work!

The need for speed: Some assembly required.

It seems clear that even the fastest BASIC tricks presented so far are still not as fast as an attract screen really needs to be. When this happens, assembly code is the solution. There are also at least two C compilers for Color BASIC that I need to explore, since writing stuff in C would be much easier for me than 6809 assembly.

Shortly after part 4, I put out a plea for help with some assembly code that would rotate graphical color blocks on the 32 column screen. William “Lost Wizard” Astle answered that plea, so I’ll present the updated routine in his LWASM 6809 compiler format instead of as an EDTASM+ screen shot in the original article.

* lwasm attract32.asm -fbasic -oattract32.bas --map

    org $3f00

start
    ldx #1024   X points to top left of 32-col screen
loop
    lda ,x+     load A with what X points to and inc X
    bpl skip    if not >128, skip
    adda #16    add 16, changing to next color
    ora #$80    make sure high gfx bit is set
    sta -1,x    save at X-1
skip
    cmpx #1536  compare X with last byte of screen
    bne loop    if not there, repeat
    sync        wait for screen sync
    rts         done

    END

The code will scan all 512 bytes of the 32-column screen, and any byte that has the high bit set (indicating it is a graphics character) will be incremented to the next color. This would allow us to draw our attract screen border one time, then let assembly cycle through the colors.

How it works:

  • The X register is loaded with the address of the top left 32-column screen.
  • The A register is loaded with the byte that X points to, then X is incremented.
  • BPL is used to skip any bytes that do not have the high bit set. This optimization was suggested by William. An 8-bit value can be treated as an unsigned value from 0-255, or as a signed value of -127 to 128. A signed byte uses the high bit to indicate a negative. Thus, a positive number would not have the high bit set (and is therefor not in the 128-255 graphics character range).
  • If the high bit was set, then 16 is added to A.
  • ORA is used to set the high bit, in case it was in the final color range (240-255) and had 16 added to it, turning it in to a non-graphics block. Setting the high bit changes it from 0-16 to 128-143.
  • The modified value is stored back at one byte before where X now points. (This was another William optimization, since originally I was not incrementing X until after the store, using an additional instruction to do that.)
  • Finally, we compare X to see if it has passed the end of screen memory.
  • If it hasn’t, we do it all again.
  • Finally, we have a SYNC instruction, that waits for the next screen interrupt. This is not really necessary, but it prevents flickering of the screen if the routine is being called too fast. (I’m not 100% sure if this should be here, or at the start of the code.)

The LWASM compiler has an option to generate a BASIC program full of DATA statements containing the machine code. You can then type that program in and RUN it to get this routine in memory. The command line to do this is in the first comment of the source code above.

10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 END
80 DATA 16128,16147,142,4,0,166,128,42,6,139,16,138,128,167,31,140,6,0,38,241,19,57,-1,-1

The program loads at $3f00 (16128), meaning it would only work on a 16K+ system. There is no requirement for that much memory, and it could be loaded anywhere else (even on a 4K system). The machine code itself is only 20 bytes. Since the code was written to be position independent (using relate branch instructions instead of hard-coded jump instructions), you could change where it loads just by altering the first two numbers in the DATA statement (start address, end address).

For instance, on a 4K CoCo, memory is from 0 to 4095. Since the assembly code only uses 20 bytes, one could load it at 4076, and use CLEAR 200,4076 to make sure BASIC doesn’t try to overwrite it. However, I found that the SYNC instruction hangs the 4K CoCo, at least in the emulator I am using, so to run on a 4K system you would have to remove that.

Here is the BASIC program modified for 4K. I added a CLEAR to protect the code from being overwritten by BASIC, changed the start and end addresses in the data statements, and altered the SYNC code to be an RTS (changing SYNC code of 19 to a 57, which I knew was an RTS because it was the last byte of the program in the DATA statements). This means it is wasting a byte, but here it is:

5 CLEAR 200,4076
10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 END
80 DATA 4076,4095,142,4,0,166,128,42,6,139,16,138,128,167,31,140,6,0,38,241,57,57,-1,-1

Using the code

Lastly, here is an example that uses this routine. I’ll use the BASIC loader for the 32K version, then add Jason’s variation #1 to it, modified by renaming it to start at line 100, and removing the outer infinite FOR Z loop so it only draws once. I’ll then add a GOTO loop that just executes this assembly routine over and over.

5 CLEAR 200,16128
10 READ A,B
20 IF A=-1 THEN 70
30 FOR C = A TO B
40 READ D:POKE C,D
50 NEXT C
60 GOTO 10
70 GOTO 100
80 DATA 16128,16147,142,4,0,166,128,42,6,139,16,138,128,167,31,140,6,0,38,241,19,57,-1,-1
100 CLS0:C=143:PRINT@268,"ATTRACT!";
120 FORX=0TO15:POKEX+1024,C:POKEX+1040,C:POKE1535-X,C:POKE1519-X,C:POKE1055+(32*X),C:POKE1472-(32*X),C:GOSUB150:NEXT:GOSUB150
130 EXEC 16128:GOTO 130
150 C=C+16:IF C>255 THEN C=143
160 RETURN

And there you have it! An attract screen for BASIC that uses assembly so it’s really not a BASIC attract screen at all except for the code that draws it initially using BASIC.

I think that about covers it. And, this routine also looks cool on normal 32-column VDG graphics screens, too, causing the colors to flash as if there is palette switching in use. (You can actually palette switch the 32-column screen colors on a CoCo 3.)

Addendum: WAR by James Garon

On 7/2/2022, Robert Gault posted to the CoCo list a message titled “Special coding in WAR“. He mentioned some embedded data inside this BASIC program. You can download it as a cassette or disk image here:

https://colorcomputerarchive.com/search?q=War+%28Tandy%29&ww=1

You can even go to that link and click “Play Now” to see the game in action.

I found this particularly interesting because this BASIC program starts with one of the classic CoCo attract screens this article series is about. In the program, the author did two tricks: One was to embed graphics characters in a PRINT statement, and the other was to embedded a short assembly language routine in a string that would cycle through the screen colors, just like my approach! I feel my idea has been validated, since it was already used by this game in 1982. See it in action:

https://colorcomputerarchive.com/test/xroar-online/?machine=cocous&basic=RUN%22WAR%22%5cr&cart=rsdos&disk0=/unzip%3Ffile%3DDisks/Games/War%20(Tandy).zip/WAR.DSK

And if you are curious, the code in question starts at line 60000. I did a reply about this on the CoCo mailing list as I dug in to what it is doing. That sounds like it might make a part 6 of this series…

Until next time…

Robert Gault’s EDTASM+for native CoCo 6809/6309 assembling.

Radio Shack introduced the Color Computer in 1980. It came 4K or RAM, and Microsoft Color BASIC in the ROM. It could be expanded to 16K RAM, which allowed adding a second ROM for Extended BASIC. A plug-in disk interface cartridge came later, with it’s own ROM containing Disk BASIC.

I’ve often wondered what Microsoft used to write the CoCo BASIC ROMs in.

EDTASM+

Around 1982, Radio Shack released the EDTASM+ ROM-Pak for the Color Computer. It was a 6809 assembler for machine language, as well as a debugger. It could load and save files (source code and final binaries) to cassette tape.

There was also Disk EDTASM+, which added some extra features — though the most important one was probably that it could load and save to a disk drive, making that process far faster.

Someone put up a nice EDTAM+ information page on my CoCoPedia.com website.

Since Microsoft created EDTASM, I suspect it may have been (or was at least based on) the tool they used for writing the Color Computer ROMs.

If you want to see it in action, head over to the JS Mocha CoCo emulator where you will find it available from a menu:

http://www.haplessgenius.com/mocha/

The EDTASM+ ROM-Pak and Bill Barden’s Color Computer Assembly Language Programming book where how I learned 6809 assembly. I later used Disk EDTASM+.

EDTASM++?

While the CoCo 1 and 2 were basically the same machine, just with redesigned circuit boards and enclosures, the 1986 CoCo 3 was quite different. It could operate in a double speed more, and provided up to 80 columns of text versus the original CoCo’s 32 columns. It also came with 128K — double what the CoCo 1/2 could handle — and could be expanded to 512K (though third party folks figured out how to do 1 an 2 megabyte upgrades).

Unfortunately, Radio Shack never released an update to the EDTASM+ ROM-Pak or disk software. It was still limited to the small memory and screen size of the original 1980 CoCo hardware.

Folks came up with various patches. I had one that patched my Disk EDTASM+ to run in 80 columns on the CoCo 3, in double speed more (faster assembling!) while setting the disk drive step rate to 6ms. It was a much nicer experience coding with long lines!

After this I moved on to OS-9, and used the Microware assemblers (asm and rma) from OS-9 Level 1 and Level 2. I am not sure I touched EDTASM+ again until I played with it on JS Mocha, decades later.

Hitatchi 6309

Hitcachi made a clone of the 6809. This replacement chip had some undocumented features such as more registers and more op codes. EDTASM+ couldn’t help with that, but there were some OS-9 compilers that were updated to support it.

That’s when folks like Robert Gault came to our rescue with enhancements for the original EDTASM+. Robert added support for the 6309, and many new features — including CoCo 3 compatibility.

His EDTASM+ looks like this on a CoCo 3 in 80 column mode:

Robert Gault’s EDTASM+ update.

If you notice the copyright date, you’ll see he has continued to update and support it. Today he offers it in a variety of versions that run on the original CoCo 1/2, a CoCo 3, certain emulators, RGB-DOS support (for hard drive users), CoCoSDC (the modern SD card floppy replacement) as well as supporting things like DriveWire.

You can pick up your own copy for $35 as follows:

EDTASM6309 $35
Robert Gault
832 N.Renaud
Grosse Pointe Woods, MI 48236
USA
e-mail:	robert.gault@att.net

There are a number of new features added. Here is the list provided in the README.txt file:

CHANGES TO EDTASM (Tandy Version)
1) Tape is no longer supported; code has been removed.
2) Buffer size increased to over 42K bytes.
3) Directory obtainable from both Editor and ZBUG; V command.
4) Multiple FCB and FDB data per line.
5) FCS supported.
6) SET command now works properly.
7) Screen colors remain as set from Basic before starting EDTASM.
8) Symbol table printed in five names per line on Coco3.
9) On assembly with /NL option, actual errors are printed.
10) Warning error on long branch where short is possible.
11) ZBUG now defaults to numeric instead of symbolic mode.
12) RGB DOS users now have support for drive numbers higher than 3.
13) Hitachi 6309 opcodes are now supported for both assembly and disassembly
including latest discoveries.
14) HD6309 detection is included and if present incorporates a ZBUG error trap
for illegal opcodes and enables monitoring and changing the E,F,V registers
from ZBUG.
15) Coco 3 users can now safely exit to Basic and use their RESET button from
EDTASM.
16) Keyboard now has auto repeat keys when keys are held down.
17) Lower case is now supported for commands, opcodes, options, and symbols.
Take care when loading or saving files or using symbols, ex. NAME does not
equal name, \.A not= \.a, etc.
18) Local names are now supported. Format is A@-Z@ and a@-z@ for 52 local
symbols. New sets of locals are
started after each blank line in the source code. Local
symbols do not appear in or clutter symbol table.
19) Local symbols can only be accessed from ZBUG in expanded form:
ex. A@00023  not A@.
20) Now reads source code files that don't have line numbers. Writes normal
source files with line numbers ( W filename ) or without line numbers
( W# filename ).
21) Macro parameters now function correctly from INCLUDEd files.
22) While in the Editor, the U key will backup one screen in your source file.
23) DOS.BAS can be used to program the F1 and F2 keys on a Coco3. See below.
24) Coco3 WIDTH80 now uses 28 lines of text.

Coco 1&2 versions do require 64K RAM, the Coco 3 version will work with 128K
of RAM. You can assemble 6309 code even if your Coco has a 6809 cpu.

It also adds some new commands:

V - obtains a directory from either Editor or ZBUG modes.
U - scrolls backwards through source code.
FCS - is used exactly like FCC but automatically add $80 to the last character
in the string.
FCB, FDB - for multiple entries per line entries should be seperated by a
comma. Make sure that the comment field for that line DOES NOT CONTAIN ANY
COMMAS or an error will result.
New ‘V’ directory command in Robert Gault’s EDTASM+ update.

If you are wanting to do some CoCo assembly language programming, I highly recommend you sending $35 to Robert and pick up a copy of his version. EDTASM+ is tricky to learn, and his updates make it a bit less tricky.

And tell him Allen sent ya.

Until next time…

Online 6809 emulator with semi-MC6847 support

Awhile back, the Internet led me to a wondrous thing: an online 6809 emulator, complete with compiler, debugger, and text/graphical output!

http://6809.uk

This website, “designed and coded” by Gwilym Thomas, is amazing. If has a spot where you can enter 6809 assembly source code, then you can compile and run it!

http://6809.uk

It even has a few sample programs you can select and try out.

While it runs, you see the registers update, as well as a source-level debugger showing what op codes are currently executing. You can set break points, and memory watch points, too.

It also provides text output in the form of the MC6847 VDG chip (used by the CoCo, and a few other systems). The graphics mode is different VDG. While it supports some similar resolutions, it also adds a 16-color display.

The screen memory is mapped to $400 (1024) just like the CoCo, so you can run stuff like this:

start ldx #1024
loop inc ,x+
 cmpx #1536
 bne loop
 bra start

If you past that in to the Assembly language input window and then click Assemble source code, you will see the text characters in the Text screen preview window cycling through values. Neat!

The graphics screen starts just past the text screen at $600 (1536). I think that might be where it started on a non-Disk Extended Color BASIC system. (See my article about memory on the CoCo for more details.)

The documentation notes this about the modes:

The graphics screen is a memory-mapped display of 6144 bytes of RAM beginning at address $0600. There are 3 graphics colour modes, in which either 1, 2, or 4 bits represent a single pixel in 2, 4, or 16 colours respectively. Addresses increase left to right, top to bottom as for the text screen.

Columns and rows are zero-base with (0, 0) at the (left, top). Sequences of bits (1, 2, or 4) from high to low represent pixels from left to right. The 2 colour mode has 256 pixels by 192, the 4 colour 128 by 192, each line being 32 bytes. The 16 colour mode has 128 pixels by 96, each line being 64 bytes.

Example: in 4 colour (2 bit) mode pixel (93, 38) would be in byte $0600+(3832)+trunc (93/4), because there are 4 pixels in a byte. The colour value (0..3) would be stored in bits 5 & 4, ie. shifted left ((4-1)2)-((93 mod 4)*2 times).

http://6809.uk/doc/doku.php?id=interactive_6809_emulator

Changing screen modes is NOT done via simulated VDG registers. Instead, it has code that looks like this:

    ldd #$0204       ; select 4 colour graphics mode
    swi3

I have not been able to find details on what values represent what mode. Also, the documentation says there is keyboard input:

Click the text screen panel then start typing for the emulator to receive keyboard input. Remember that (due to limitations of the emulated hardware) when lower case characters are printed to the screen they will appear in inverse video.

http://6809.uk/doc/doku.php?id=interactive_6809_emulator

I have not figured out how this works, yet.

As far as the 6809 assembler goes, it does not parse all of the extensions that the LWTOOLS’ lwasm assembler supports, so I have been modifying my projects to be compatible with the emulator’s assembler. This has let me, with minor changes for things like ROM calls, test and debug my code in a way that is impossible on actual hardware.

Here is the documentation:

http://6809.uk/doc/doku.php?id=interactive_6809_emulator

If you create anything interesting in it, please let everyone know in the comments.

In an Internet full of so much garbage, it’s wonderful to find such a gem.

Until next time…