Color BASIC optimization challenge – more attempts

See also: part 1 and part 2

The prolific Jim Gerrie has ported the original scaling demo over to the MC-10 and optimized it. On his system, it reports 9.5 seconds!

Jim Gerrie’s optimized port of the CoCo scaling demo.

He shared his code to his github page, but I’ll include it here for commentary:

0 REM scale.bas
1 GOSUB100:TM=TIMER:FORZ=1TO M:CLS:A$=STRING$(W,C):FORA=(8-INT(HB))L+E-INT(WB)TOHL STEPL:PRINT@A,A$:NEXT:IFH<1ORH>=&H10 THENQ=-Q:R=-R
2 W=W+Q:H=H+R:NEXT
3 REM 60=NTSC 50=PAL
4 T=TIMER:PRINT:PRINT (T-TM)/60;"SECONDS"
5 END
100 DIMW,H,A$,A,B,L,Q,R,C,E,M,Z
110 I=32/4:REM SCALE WIDTH
120 J=16/3:REM SCALE HEIGHT
130 D=.1:REM SCALE INC/DEC
140 S=.5:REM SCALE FACTOR
150 W=IS:H=JS
160 Q=ID:R=JD
170 L=32:M=&H64
180 B=1/2:C=&HAF:E=15
190 RETURN

Not prepared to let an MC-10 beat a CoCo, I wanted to try it myself on my test system, the Xroar emulator.

I did. And I got 10.75 seconds! The CoCo is slower than the MC-10?

But that’s still faster than our previous best attempt of 13.3 seconds by Xroar author Ciaran Anscomb. Maybe I can speed the CoCo up a bit. Someone commented that DISK BASIC was slightly slower due to hooking in to an interrupt (I think it uses that for a time delay when turning off the drive motor after disk access). Since the MC-10 doesn’t have DISK anyway, I thought I’d disable RS-DOS and try it again.

11.78 seconds without Disk BASIC. IT got even slower? That’s odd. I tried this last night on the Mac Xroar emulator and thought it was slightly faster.

We have seen variances between emulators and systems when it comes to timing, so at some point we need to find a better way to do this. I mean, the MC-10 can’t be faster, can it?

Speaking of the MC-10, first, you should be aware that JIm Gerrie is one of the most prolific programmers around, porting and writing software on what seems to be a daily basis. Just check out his YouTube channel sometime. He has an incredible version of the Rally-X arcade game, entirely in BASIC.

But I digress.

I want to point out that Jim normally wouldn’t have been able to run my scaling demo on an MC-10 since it does not include a TIMER function, nor does it support HEX numbers (as far as I know?). He is using MCX BASIC by Darren Atkinson. Darren is the designer behind the CoCoSDC floppy disk replacement project. MCX-BASIC adds things like TIMER and HEX to the MC-10, making it closer to Extended Color BASIC on the CoCo.

But I digress again.

Adam

In the previous article, Adam shared the results of his version, and has now posted his code:

1 DIM SW,SH,SM,S,TM,Z,W,H,P
2 DIM A,B,C,D,E,F,L$
3 SW=8:SH=5.33333334:SM=.1:S=.5
4 B=32:C=175:D=15:E=2:F=7
5 TM=TIMER
6 FORZ=1TO100
7 W=INT(SW*S):H=INT(SH*S)
8 P=D-INT(W/E)+(F-INT(H/E))*B
9 L$=STRING$(W,C)
10 CLS
11 FORA=1TO H:PRINT@P+A*B,L$:NEXT
12 S=S+SM
13 IF H<1 OR H>D THEN SM=-SM:S=S+(SM*E)
14 NEXT
15 PRINT:PRINT (TIMER-TM)/60;”SECONDS”

On my Xroar CoCo 2 test platform I get 17.93 seconds. Adam also sent in an interesting note which may explain some of the timing differences I am seeing reported:

This exercise also highlights the speed differences between a Coco2 and Coco3. I think the GIME chip is slower than the VDG in printing to the low-res screen. A Coco3 runs this code roughly 2 seconds slower!
Adam

Now that is an interesting observation. When I got my CoCo 3, my old machine went back in the box and I never had them both hooked up at the same time to do any comparisions.

I knew that the CoCo 3 40/80 column screens seemed slower. There are patches floating around that speed them up dramatically. Apparently it does some kind of MMU memory bank switch in and out for each character displayed. I did not realize there would be any difference in the 32 column VDG style screen. I’ll have to look into this and see if I can find out why.

Walter Zambotti

On the CoCo mailing list (if you use e-mail, and like the CoCo, you should sign up), Australian Walter Zambotti saw the original example and provided a tip:

Try changing the inner loop to remove all calculations like this
115 P2-P+32:H2=P*H+32:BK$=STRING$(W,175)
120 FOR A=P2 TO H2 STEP 32
130 PRINT @A,BK$
140 NEXT A
I believe I chopped 7 seconds off the time.
Walter Zambotti via CoColist on March 13, 2020

It seems others picked up on this as well, as I have seen some speedy attempts that pre-calculate values (so the FOR/NEXT loop only has to increment by 32 to get to the next line for PRINT) and pre-render the string of blue blocks. (I was aware of strings being quite slow after my String Theory experiments, but some of the pre-calculated values I would not have thought of.)

Nice job, Walter!

Mission: Beat the MC-10

This leaves us with a problem. Jim Gerrie’s MC-10 version is still the fastest. Perhaps the 6800 in the MC-10 and it’s BASIC is just faster. Perhaps Jim’s just better at BASIC than we are. I’m willing to accept the second part, but my pride doesn’t want the first part to be true.

Can you make this faster than what Jim did? With the various attempts shared so far, perhaps bits and pieces of each of them could be combined to create something even faster?

Here is the original un-optimized code again for reference:

0 REM scale.bas
10 SW=32/4 ' SCALE WIDTH
20 SH=16/3 ' SCALE HEIGHT
30 SM=.1 ' SCALE INC/DEC
40 S=.5 ' SCALE FACTOR
70 TM=TIMER:FOR Z=1 TO 100
80 W=INT(SWS) 90 H=INT(SHS)
100 P=15-INT(W/2)+(7-INT(H/2))32 110 CLS 120 FOR A=1 TO H 130 PRINT@P+A32,STRING$(W,175)
140 NEXT A
150 S=S+SM
160 IF H<1 OR H>15 THEN SM=-SM:S=S+(SM*2)
170 NEXT Z
180 ' 60=NTSC 50=PAL
190 PRINT:PRINT (TIMER-TM)/60;"SECONDS"

If you don’t have access to a real CoCo or emulator, you could use one of these from a web browser:

Although there is a way to load code into them, I am not sure if there is a way to get the code back out. However, I have been typing my BASIC up in a text editor on my Mac. Xroar allows mounting a test file (with the extension of .bas or .asc) as a cassette tape, then doing a “CLOAD” to load it in as if it were a program saved to tape in ASCII format. This allows me to edit and make changes on my Mac, then load the results into Xroar for testing.

If you try Xroar Online, set the “Machine:” type to “Tandy CoCo (NTSC)” to match the timing of the emulated Amercian CoCo I am using (where TIMER is 60 tickts per second, versus the PAL version that is 50 per second). Then, save out the code as a text file and mount it using the “Tape:” insert option. You can then type CLOAD in the emulator to load and RUN it.

Load ASCII BASIC as if it was a tape via Xroar Online.

Any takers?

Until next time…

William Astle April 4, 2020 at 12:33 pm

Adam’s suggestiont that the GIME is slower than the VDG for the 32 column screen is actually wrong. It has nothing to do with the GIME or VDG since neither chip affects the speed of access to RAM. It’s all down to the code being run.

The difference is that the Coco3 additions add another bit of code to the “print a character” code path. You’ll find that plain Color Basic will be faster than ECB for similar reasons. This is because the “print a character” code path does a complete video mode reset for every character printed. While this isn’t nearly as bad as the horrible slow screen mapping code for the 40/80 column screens, it is still extra code nevertheless. The ECB additions force the VDG to text mode and update the SAM registers to make sure the screen is at the correct address. The Coco3 additions also program the GIME video mode registers for the 32 column screen (using code that’s not as efficient as it could be, but which isn’t nearly as bad as the character output routine for the 40/80 column screen).

The slowness in the 40/80 column display routines is that the code sets all 16 MMU registers once to map the screen to CPU memory and then sets all 16 again to unmap it. And it does this using an inefficient loop with at least one more subroutine call than it needs. By replacing both the mapping and unmapping routines with short routines that set a single MMU register (which is all that is needed), the output speed of the 40/80 column screen becomes comparable to the 32 column screen.

Also, in case you’re wondering why Disk Basic might make things faster than Extended Basic on its own: Disk Basic 1.1 actually replaces the entire command interpretation loop with a consolidated version. It also, if you’re not on a Coco3, does a check to see if a key is pressed before doing the BREAK check. (Disk Basic 1.0 doesn’t, but you really don’t want to be running Disk Basic 1.0.) Depending on your underlying version of Color Basic, POLCAT may not do that. You can tell if that’s the case by running a program that does output in a loop (and maybe does some other steps inside the loop to increase the number of statements) and then holding a key down. If the output slows down while the key is down and speeds up when you release it, you have the “is a key down” version of the BREAK check. If it stays the same speed all the time, you don’t. Unfortunately, the Coco3 stuff patches that “is a key down” check out so you won’t get that benefit on the Coco3.

Reply ↓

5 thoughts on “Color BASIC optimization challenge – more attempts”

William Astle April 4, 2020 at 12:33 pm

Adam’s suggestiont that the GIME is slower than the VDG for the 32 column screen is actually wrong. It has nothing to do with the GIME or VDG since neither chip affects the speed of access to RAM. It’s all down to the code being run.

The difference is that the Coco3 additions add another bit of code to the “print a character” code path. You’ll find that plain Color Basic will be faster than ECB for similar reasons. This is because the “print a character” code path does a complete video mode reset for every character printed. While this isn’t nearly as bad as the horrible slow screen mapping code for the 40/80 column screens, it is still extra code nevertheless. The ECB additions force the VDG to text mode and update the SAM registers to make sure the screen is at the correct address. The Coco3 additions also program the GIME video mode registers for the 32 column screen (using code that’s not as efficient as it could be, but which isn’t nearly as bad as the character output routine for the 40/80 column screen).

The slowness in the 40/80 column display routines is that the code sets all 16 MMU registers once to map the screen to CPU memory and then sets all 16 again to unmap it. And it does this using an inefficient loop with at least one more subroutine call than it needs. By replacing both the mapping and unmapping routines with short routines that set a single MMU register (which is all that is needed), the output speed of the 40/80 column screen becomes comparable to the 32 column screen.

Also, in case you’re wondering why Disk Basic might make things faster than Extended Basic on its own: Disk Basic 1.1 actually replaces the entire command interpretation loop with a consolidated version. It also, if you’re not on a Coco3, does a check to see if a key is pressed before doing the BREAK check. (Disk Basic 1.0 doesn’t, but you really don’t want to be running Disk Basic 1.0.) Depending on your underlying version of Color Basic, POLCAT may not do that. You can tell if that’s the case by running a program that does output in a loop (and maybe does some other steps inside the loop to increase the number of statements) and then holding a key down. If the output slows down while the key is down and speeds up when you release it, you have the “is a key down” version of the BREAK check. If it stays the same speed all the time, you don’t. Unfortunately, the Coco3 stuff patches that “is a key down” check out so you won’t get that benefit on the Coco3.

Loading...

Reply ↓
1. Allen Huffman Post authorApril 7, 2020 at 8:53 pm
  
  The happens for the 32 column display TOO? I did not know that (or if I did, I forgot.) Thanks for the info! I was thinking of you recently when I had some puzzling questions about how BASIC was doing things. I’ll have to drop you a note when I get to them.
  
  Loading...
  
  Reply ↓
Jason Pittman April 8, 2020 at 1:30 pm

I’m seeing a tiny (<0.1) increase by changing the OR statement to an IF…ELSE. Could BASIC be evaluating both sides of an OR even if the first is true? For example, in Jim Gerrie's example, change:

:IFH=&H10 THENQ=-Q:R=-R

to:

:IFH=&H10 THENGOSUB200

(And add a subroutine for reversing the sign… “200 Q=-Q:R=-R:RETURN”)

Loading...

Reply ↓
1. Jason Pittman April 8, 2020 at 1:35 pm
  
  Somehow in posting that, I butchered the ELSE statement, but this example should show what I mean:
  https://gist.github.com/jsonpittman/14e972f99d1869d90ab3c08710dc2f44
  
  Loading...
  
  Reply ↓
  1. Allen Huffman Post authorApril 9, 2020 at 9:25 am
    
    This is *exactly* what I just learned about in an 8-Bit Show And Tell YouTube video. I have an article about it scheduled for next week, I think. I will NEVER use IF AND/OR again after learning this. I had no idea! How did you learn this?
    
    Loading...
    
    Reply ↓

Sub-Etha Software

"In Support of the CoCo and OS-9 since 1990!"

Color BASIC optimization challenge – more attempts

Adam

Walter Zambotti

Mission: Beat the MC-10

Like this:

Related

5 thoughts on “Color BASIC optimization challenge – more attempts”

Leave a ReplyCancel reply