See also: challenge, responses, and more responses.
Today Sebastian Tepper submitted a solution to the “draw black” challenge. He wrote:
I think this is much faster and avoids unnecessary SETs. Instruction 100 will do the POKE only once per character block.
– Sebastian Tepper, 7/5/2022
The routine he presented (seen in lines 100 and 101) looked like this:
10 CLS
20 FOR A=0 TO 31
30 X=A:Y=A:GOSUB 100
40 NEXT
50 GOTO 50
100 IF POINT(X,Y)<0 THEN POKE 1024+Y*16+X/2,143
101 RESET(X,Y):RETURN
It did see the criteria of the challenge, correctly drawing a diagonal line from (0,0) down to (31,31) on the screen. And, it was fast.
POINT() will return -1 if the location is not a graphics character. On the standard CLS screen, the screen is filled with character 96 — a space. (That’s the value you use to POKE to the screen, but when printing, it would be CHR$(32) instead.) His code would simply figure out which screen character contained the target pixel, and POKE it to 143 before setting the pixel.
So I immediately tried to break it. I wondered what would happen if it was setting two pixels next to each other in the same block. What would RESET do?
I added a few lines to the original test program so it drew the diagonal line in both directions PLUS draw a box (with no overlapping corners). My intent was to make it draw a horizontal line on an even pixel line, and odd pixel line, and the same for verticals. It looks like this (and the original article has been updated):
10 CLS 20 FOR A=0 TO 15 30 X=A:Y=A:GOSUB 100 31 X=15-A:Y=16+A:GOSUB 100 32 X=40+A:Y=7:GOSUB 100 33 X=40+A:Y=24:GOSUB 100 34 X=39:Y=8+A:GOSUB 100 35 X=56:Y=8+A:GOSUB 100 40 NEXT 50 GOTO 50
And this did break Sebastian’s routine… and he immediately fixed it:
100 IF POINT(X,Y)<0 THEN POKE 1024+INT(Y/2)*32+INT(X/2),143 101 RESET(X,Y):RETURN
I haven’t looked at what changed, but I see it calculates the character memory location by dividing Y by two (and making sure it’s an integer with no floating point decimals — so for 15 becomes 7 rather than 7.5), and then adds half of X. (Screen blocks are half as many as the SET/RESET pixels).
And it works. And it works well — all cases are satisfied.

And if that wasn’t enough, some optimizations came next:
And for maximum speed you could change line 100 from:
100 IF POINT(X,Y)<0 THEN POKE 1024+INT(Y/2)*32+INT(X/2),143
to:
100 IFPOINT(X,Y)<.THEN POKE&H400+INT(Y/2)*&H20+INT(X/2),&H8F
To time the difference, I added these extra lines:
15 TIMER=0
and:
45 PRINT TIMER
This lowers execution time from 188 to 163 timer units, i.e., down to 87% of the original time.
– Sebastian Tepper, 7/5/2022
Any time I see TIMER in the mix, I get giddy.
Spaces had been removed, 0 was changed to . (which BASIC will see a much faster-to-parse zero), and integer values were changed to base-16 hex values.
Also, in doing speed tests about the number format I verified that using hexadecimal numbers was more convenient only when the numbers in question have two or more digits.
– Sebastian Tepper, 7/5/2022
Awesome!
Perhaps final improvement could be to change the screen memory location from 1024/&H400 to a variable set to that value, the multiplication value of 32/&h20, as well as the 143/&H8F. Looking up a variable, if there are not too many of them in the list before the ones you’re looking up, can be even faster.
Using the timer value of 163 for our speed to beat, first I moved that extra space just to see if it mattered. No change.
Then I declared three new variables, and used DIM to put them in the order I wanted them (the A in the FOR/NEXT loop initially being the last):
11 DIM S,C,W,A 12 S=1024:W=32:C=143 ... 100 IFPOINT(X,Y)<.THENPOKES+INT(Y/2)*W+INT(X/2),C 101 RESET(X,Y):RETURN
No change. I still got 163. So I moved A to the start. A is used more than any other variable, so maybe that will help:
11 DIM A,S,C,W
No change — still 163.
Are there any other optimizations we could try? Let us know in the comments.
Thank you for this contribution, Sebastian. I admire your attention to speed.
Until next time…

 
						
I found two additional optimizations:
POKE function truncates decimals anyway, so we can use X/2 instead of INT(X/2) . This gives and execution time of 159.
100 IF POINT(X,Y)<. THEN POKE &H400+INT(Y/2)*&H20+X/2,&H8F
It is possible to change the division in INT(Y/2)&H20 for (Y AND &H1E)&H10. This further lowers the execution time to 151.
100 IF POINT(X,Y)<. THEN POKE &H400+(Y AND &H1E)*&H10+X/2,&H8F
So, 151 is the number to beat (for now).
I’ll post another follow-up soon, with these changes. I also did some extra benchmarking, running through it 100 times and using the average. I got a slightly different number — what are you running this on? CoCo 3 BASIC is slower (more RAM hooks it jumps through), and I think having DISK BASIC is slower than just having EXTENDED and such. I wasn’t aware of that when I started writing benchmark programs.
POINT is slow. SET is slow. I need to look at the code and see why. I guess it’s all the parsing of parameters and such.
I am using a COCO 1 in XROAR with the following configuration:
xroar.exe -machine cocous -bas bas11.rom -extbas extbas10.rom -ram 32 -kbd-translate -lp-file printer.txt
with RS disk cartridge enabled.
May be a difference in extbas10. I’ll test.
I think X, Y is also frequently referred to. What about changing line 11 to
11 DIM X,Y,A,S,C,W`?To put them at the top of the Variable table?
Exactly. ;)
But, if just a few variable exist the gain in search time might be not the great at all.
Previous benchmark was 151, but I realized that you can also replace the numbers in lines 30-35 with their hexadecimal equivalents–after all, they are inside a loop that runs 16 times!
Now my benchmark is 145 timer units. In general I get small variations, like +1 or -1 unit when running the code several times. I also got 146 and 144, so I am giving the value in the middle.
10 CLS
15 TIMER=0
20 FOR A=0 TO 15
30 X=A:Y=A:GOSUB 100
31 X=&HF-A:Y=&H10+A:GOSUB 100
32 X=&H28+A:Y=7:GOSUB 100
33 X=&H28+A:Y=&H18:GOSUB 100
34 X=&H27:Y=8+A:GOSUB 100
35 X=&H38:Y=8+A:GOSUB 100
40 NEXT
45 PRINT TIMER
50 GOTO 50
100 IF POINT(X,Y)<. THEN POKE &H400+(Y AND &H1E)*&H10+X/2,&H8F
101 RESET(X,Y):RETURN
I wondered how far can we push this with some assembly helper routine, but without leaving the BASIC environment.
I benchmarked 82 timer units with the following code:
10 CLS:GOSUB 1000
15 TIMER=0
20 FOR A=0 TO &HF
30 X=A:Y=A:GOSUB 100
31 X=&HF-A:Y=&H10+A:GOSUB 100
32 X=&H28+A:Y=7:GOSUB 100
33 X=&H28+A:Y=&H18:GOSUB 100
34 X=&H27:Y=8+A:GOSUB 100
35 X=&H38:Y=8+A:GOSUB 100
40 NEXT
45 PRINT TIMER
50 GOTO 50
100 Z=USR0(Y*&H100+X):RETURN
999 ‘ENHANCED RESET(X,Y)
1000 Z$=”BDB3ED340644C6203D8E0400308BE661543A350684015649C610544A2AFC536D842B04C48F2002E484E78439″
1010 M=474:DEFUSR0=M ‘USE CASSETTE BUFFER AREA (44 OUT OF 256 BYTES)
1020 FOR Z=1 TO LEN(Z$) STEP 2
1030 POKE M,VAL(“&H”+MID$(Z$,Z,2)):M=M+1
1040 NEXT Z
1050 RETURN
Some comments:
1) For those trying to break the code :-) you can try changing CLS with CLS 4 or any other color in line 10 and it will also work as intended. On the other hand, I decided NOT to checking the sanity of Y and X coordinates, so that’s a possible way to get unexpected results and eventually mess with memory outside the video area.
2) I based the code on the disassembly of Unravelling Color Basic, though with some changes in the register usage and avoiding external memory addresses for intermediate values. Someone more proficient than me might find additional ways to optimize the code.
3) I tested the assembly code in the 6809.uk online assembler–great tool! I think I found a bug in its compiler, because the instructions BMI (branch on minus) and BPL (branch on plus) were assembled with their codes interchanged, which of course broke the code functionality. It took me a lot of time in EDTASM until I found that when debugging the code said branch instructions had been assembled wring (0x2A exchanged with 0x2B). After manually patching this the code worked. Something to keep in mind.
When I first learned 6809 assembly, I don’t think it dawned on me that one could pass in two 8-bit values as a 16-bit for USR0. It’s just like using A and B registers to make D. I used to POKE them in memory spots then call a routine. “Wish I knew then what I know now!”
In fact you could even pass three values and implement a sort of SET(X,Y,C) function that accepts C between 0 and 8.
This is possible if you pack all three values using 15 bits like this:
X: 0-63 (64 values) -> 6 bits -> lower 6 bits of ACCB
Y: 0-31 (32 values) -> 5 bits -> lower 3 bits of ACCA : upper 2 bits of ACCB
C: 0-8 (9 values) -> 4 bits -> bits 4 to 7 of ACCA
TOTAL: 15 bits.
Packing function: 2048C+64Y+X
Clever. Now I need to revisit my BASE64 encoding articles. I don’t even recall if I finished the series, but it was doing something using 6-bits, then figuring out ways to pack them together in three bytes or something. I think I had another part of it in progress but I got stumped trying to figure something out.
https://subethasoftware.com/2020/11/11/compressing-basic-data-with-base-64-part-1/
Did you try to contact the maintainer on 6809.uk to place a bug report for the BMI/BPL irregularity?
I tried to contact him to thank him for such a great site, when I first discovered it, but I was not able to find any contact details.
This is the listing I got initially after assembling the code in 6809.uk:
START:
4000: BD B4 F4 JSR $B3ED
4003: 34 06 PSHS A,B
4005: 44 LSRA
4006: C6 20 LDB #$20
4008: 3D MUL
4009: 8E 04 00 LDX #$0400
400C: 30 8B LEAX D,X
400E: E6 61 LDB $01,S
4010: 54 LSRB
4011: 3A ABX
4012: 35 06 PULS A,B
4014: 84 01 ANDA #$01
4016: 56 RORB
4017: 49 ROLA
4018: C6 10 LDB #$10
LOOP:
401A: 54 LSRB
401B: 4A DECA
401C: [2B] FC BPL LOOP
401E: 53 COMB
401F: 6D 84 TST ,X
4021: [2A] 04 BMI FINAL1
4023: C4 8F ANDB #$8F
4025: 20 02 BRA FINAL2
FINAL1:
4027: E4 84 ANDB ,X
FINAL2:
4029: E7 84 STB ,X
402B: 39 RTS
Notice [2B] and [2A] machine codes should be interchanged, which I manually corrected in the BASIC implementation.
I am really going to enjoy digging in to this code.
This commented version of the assembly listing might be of aid in understanding what it does:
01DA BD B3ED 00110 START JSR $B3ED ;GET VER:HOR COORDS IN ACCA:ACCB
01DD 34 06 00120 PSHS A,B ;SAVE VER:HOR FOR LATER USE
01DF 44 00130 LSRA ;TWO HOR PIXELS/CHAR
01E0 C6 20 00140 LDB #32 ;32 BYTES/ROW
01E2 3D 00150 MUL ;GET ROW OFFSET OF CHAR POSITION
01E3 8E 0400 00160 LDX #$400 ;SCREEN BUFFER ADDRESS
01E6 30 8B 00170 LEAX D,X ;ADD ROW OFFSET TO SCREEN BUFFER ADDRESS
01E8 E6 61 00180 LDB 1,S ;GET HOR COORD
01EA 54 00190 LSRB ;TWO HOR PIXELS/CHAR
01EB 3A 00200 ABX ;ADD HOR OFFSET TO FORM CHAR ADDRESS
01EC 35 06 00210 PULS A,B ;GET VER:HOR COORDS IN ACCA:ACCB
01EE 84 01 00220 ANDA #1 ;KEEP ONLY LSB OF VER COORD
01F0 56 00230 RORB ;LSB OF HOR COORD TO CARRY FLAG
01F1 49 00240 ROLA ;LSB OF HOR COORD TO BIT 0 OF ACCA
01F2 C6 10 00250 LDB #$10 ;MAKE A BIT MASK-TURN ON BIT 4
01F4 54 00260 LOOP LSRB ;SHIFT IT RIGHT ONCE
01F5 4A 00270 DECA ;SHIFTED IT ENOUGH?
01F6 2A FC 00280 BPL LOOP ;NOT YET
01F8 53 00290 COMB ;COMPLEMENT TO FORM BIT MASK
01F9 6D 84 00300 TST ,X ;CHECK IF CHAR ON SCREEN IS GRAPHIC
01FB 2B 04 00310 BMI FINAL1 ;IT’S ALREADY A GRAPHIC CHAR
01FD C4 8F 00320 ANDB #$8F ;FORCE ALL-GREEN GRAPHIC CHAR
01FF 20 02 00330 BRA FINAL2 ;READY TO APPLY BIT MASK
0201 E4 84 00340 FINAL1 ANDB ,X ;APPLY BIT MASK
0203 E7 84 00350 FINAL2 STB ,X ;STORE MODIFIED CHARACTER
0205 39 00360 RTS ;BACK TO BASIC
And you built this using the online 6809 emulator?
The process was as follows:
1) Copied the SET, RESET and POINT code from Color Basic Unravelled to a text editor
2) Deleted unneeded parts of the code, rearranged, and tried to optimize register usage
3) Copied this assembly code into the 6809 emulator window and compiled, then tested a few cases by manually changing A and B registers (I skipped the first opcode that calls INTCNV).
4) Copied assembled code from the 6809 emulator to the text editor again and manually assembled al the hexadecimal opcodes into the Z$ string.
5) Saved the BASIC program with the assembly loader and tried it in XROAR.
6) The code did not work, so I loaded EDTASM cartridge in XROAR and used ZBUG to inspect the assembly code in memory, following its execution step by step
7) At this point I realized the disassembly showed BMI where BPL was needed and viceversa.
8) Went back to the text editor and manually patched the hex values to get the proper branch instructions and reloaded in XROAR.
Bottom line: it seems there is a bug in the 6809 emulator, but it is a VERY powerful tool which is much easier to use than ZBUG, so I’ll keep using it to develop and test code snippets.