Well, here we go again with a few more tidbits from comments to a previous installment…
Lowercase to Uppercase
John Klassek once again provides some interesting tips. He looked at my Building a Better BASIC Input article example:
30 LN=INSTR(“YyNn”,A$):IF LN=0 THEN PRINT”YES OR NO ONLY!”:GOTO 20 40 ON LN GOTO 100,100,200,200
…and suggested this:
40 ON (LN+1)/2 GOTO 100,200
…and noted:
…and you don’t have to keep the line numbers twice in ON…GOTO.
Well that makes sense. The extra bytes taken to do the math would be smaller than all the doubled line numbers. But does doing the math take more time than parsing through the extra line numbers? Let’s fine out:
5 DIM TE,TM,B,A,TT 10 FORA=0TO4:TIMER=0:TM=TIMER 20 FORB=0TO1000 30 LN=INSTR("YyNn",A$):IF LN=0 THEN PRINT"YES OR NO ONLY!":GOTO 20 40 ON LN GOTO 100,100,200,200 70 NEXT:TE=TIMER-TM 80 TT=TT+TE:PRINTA,TE 90 NEXT:PRINTTT/A:END 100 GOTO 70 200 GOTO 70
This produces 866.
John’s suggestion:
5 DIM TE,TM,B,A,TT
10 FORA=0TO4:TIMER=0:TM=TIMER
20 FORB=0TO1000
30 LN=INSTR("YyNn",A$):IF LN=0 THEN PRINT"YES OR NO ONLY!":GOTO 20
40 ON (LN+1)/2 GOTO 100,200
70 NEXT:TE=TIMER-TM
80 TT=TT+TE:PRINTA,TE
90 NEXT:PRINTTT/A:END
100 GOTO 70
200 GOTO 70
…jumps up to 1214! So, smaller code, but at a speed penalty. (See: Size versus Speed.) Brute force is often faster.
However, I also discussed converting characters to UPPERCASE so you only had to compare uppercase. That came with quite the speed penalty:
0 DIM A$,C:A$="*" 5 DIM TE,TM,B,A,TT 10 FORA=0TO4:TIMER=0:TM=TIMER 20 FORB=0TO1000 40 C=ASC(A$):IF C>=97 AND C<=122 THEN A$=CHR$(C-32) 70 NEXT:TE=TIMER-TM 80 TT=TT+TE:PRINTA,TE 90 NEXT:PRINTTT/A:END
That benchmark was 960.
John then presented a much faster approach by using strings and the INSTR command:
0 DIM A$,C,Z:A$="*" 1 AL$="aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ" 5 DIM TE,TM,B,A,TT 10 FORA=0TO4:TIMER=0:TM=TIMER 20 FORB=0TO1000 40 Z=INSTR(AL$,A$): IF Z>1 THEN C=64+Z/2 ELSE C=ASC(A$) 70 NEXT:TE=TIMER-TM 80 TT=TT+TE:PRINTA,TE 90 NEXT:PRINTTT/A:END
That drops it a bit lower to 920. For a custom input routine, every bit helps make it keep up with faster typing. Nicely done! (He also notes that this assumes the character string is non-empty, which we’d ensure is the case if using an INKEY$.) And, remember that we could speed that up even more by changing the decimal numbers to HEX.
He also suggests using AND instead of my “C-32” subtraction:
In case of a conversion with A$=CHR$(ASC(A$)-32) this could be a slightly faster: A$=CHR$(ASC(A$)AND M) but only with a predefinition of M=223 (with constant 223 it would be slower!) which masks bit 5 (value 32) in the character code representation.
Challenge accepted:
0 DIM A$,C,M:A$="*" 1 M=&HDF '110111115 DIM TE,TM,B,A,TT 10 FORA=0TO4:TIMER=0:TM=TIMER 20 FORB=0TO1000 40 C=ASC(A$):IF C>=97 AND C<=122 THEN A$=CHR$(C AND M) 70 NEXT:TE=TIMER-TM 80 TT=TT+TE:PRINTA,TE 90 NEXT:PRINTTT/A:END
This returns 963 while my original using subtraction returned 960. It looks like AND is slower. HOWEVER, if you know the input is already ONLY “a-z”, you can just do the AND without the check. So, there is a use-case if you were scanning for “yYnN” and just AND the result and compare only uppercase….
It’s fun to think outside the box.
Any other suggestions?
Until then…
It’s not surprising that the divide by two method was substantially slower than just duplicating numbers. It adds an additional expression to parse and then it does the slowest of the four basic arithmetic operations. (Multiplication is faster but still quite slow even with the boost provided by MUL while additiona and subtraction are fairly quick on average.) It might still end up being faster to use the division trick if you have a large number of input options.
AND being slower than subtraction actually makes sense because AND has to do a “convert to 16 bit integer” dance for both operands which subtraction doesn’t have to do. The difference isn’t huge, but there are more steps in the AND operation. Subtraction just has to match the exponents, do a quick complement/negation of the mantissa, and then do a 32 bit add. Both operations have to renormalize afterward. Converting to a 16 bit integer requires somewhere between 16 and 31 right shifts of a 32 bit mantissa.
Looked at my first shot
40 ON (LN+1)/2 GOTO 100,200
again.
Preventing the parenthesis and reducing the expression using a shifted-by-one string in LN=INSTR(“_YyNn”,A$) one can write
40 ONLN/2GOTO 100,200
but it still costs too much time. The first character can be catched with the line following line 40.
A variation isn’t the breakthrough either (as William already stated regarding logical operators) but it looks interesting (more or less obfuscating):
40 ONLN OR1GOTO 41,,100,,200,,
The operator OR maps to all even numbers to the following odd number (2, 3 -> 3, 4,5 -> 5,…). But also slower than LN/2. The even ON-GOTO line numbers may be omitted (the interpreter actually skips on commas).
However, the motivation for all these variants is not speed, but to prevent the ON-GOTO with doubled line numbers which might be error-prone. ;)
I like it! :)
> Converting to a 16 bit integer requires somewhere between 16 and 31 right shifts of a 32 bit mantissa.
As far as I understood the routines from the ROM listing, this is not that bad. A float in the proper range (which can be easily checked by looking at the exponent) is already normalized. To bring the leading bit of the mantissa on the right place takes for the worst case 15 shifts (for the value 1) to the right. The floating point accumulator calculates with a 40 bit mantissa (with an additional high precision byte) and shifts by 8 are optimized by simply copying the mantissa bytes. So, for the worst case you have a 5 byte copy and 7 times a 40 bit shift. The result can be taken from in the most significant (top) 2 bytes of the mantissa.
But whatever this takes, a subtraction is faster anyway as clearly depicted by William.
Another thing to look at:
The measurement for the AND-variant seems not correct (beside the garbled syntax introduced by the posting?), because A$=”*” has character code 42 and never reaches the THEN branch (to actually calculate with the AND operator). It has to be compared with the initialization of something like this A$=CHR$(98).
With this value a time value of 920 for the INSTR() variant probably won’t hold, too. :(
In addition:
> 40 Z=INSTR(AL$,A$): IF Z>1 THEN C=64+Z/2 ELSE C=ASC(A$)
This is not exactly comparable because the result is only the character code, not the character itself. I think the following variation has to be measured:
40 Z=INSTR(AL$,A$): IF Z>1 THEN A$=CHR$(64+Z/2)
If I’m not wrong, this makes a further round of measurement necessary to get comparable duration values. :\
Great! More work! :) I see what you mean by the “*”.
Actually, the integer conversion routines shift all the way to the right of the 32 bit mantissa before grabbing the integer value. As you note, that will be somewhat optimized as two extra byte copy steps, but that still takes time. This operation does not use the extra calculation bits. That means *three* byte copies and 7 shifts in the worst case (1) and two byte copies in the best case (-32768).
Obviously, your method would be better. But that’s not what the ROM does.
The integer to floating point routines do start with the integer value in the two most significant bytes, though, and are consequently faster for normalization.
Are the 6502 and 6809 Microsoft BASICs really this similar in how they do floating point and such?
I know C64 BASIC ROM well, and the cross checks I casually do (mostly in the Dragon32 ROM listing, but similar to the CoCo) revealed no great differences.
I doubt the three-byte theory. As far I see (I checked it right before I did the post) the right-shift subroutine is a standard routine ($BA9A, Color Basic Unravelled) which is used for other stuff too, not only integer conversion. And this routine copies 5 bytes (including a most significant carry byte, 4 mantissa byte involving the sub byte which is meant to be the “extra calculation bits” above). Have I overlooked something?
As a rule of thumb, small absolute values costs more, for a negative value with an extra penalty for a full mantissa 2’s complement. Since the number of shifts depends only on the float exponent and the mantissa for signed numbers are not in 2’s complement rep., the value -1 is the worst case (has the same exponent as 1), cause it needs an additional mantissa 2’s complement. -32768 has a mantissa value of 32768, with an exponent indicating no (or only one) shift is needed, but the additional 2’s complement for the sign is needed. Maybe out of implementation for a 6809 (in opposite to a 6502 where this is not the case) -32768 could have an advantage with the 2’s complement costs less than one shift. Otherwise all values >= 16384 can be regarded as a best case.
My point was that it doesn’t just shift so the intenger is in the upper two bytes of the mantissa. It shifts so that it’s in the *lower* two bytes of the mantissa. That means two extra byte-level shifts. I was using “byte shift” as short hand for “move 4 bytes to the right”.
Obviously, if it just shifted so the integer was lined up in the top two bytes of the mantissa, it would be better. But that isn’t what it does.
Oh, now I see it, you are completely right (and I’m wrong). A 1 has to go through 31 bit position fully to the right, to be placed on on bit 0 (least significant bit) of the mantissa. Overlooked how the result will be extracted (the 2 lowest bytes! OMG). Holy sh…, never expected that they did it the “long way” … now I am not longer sure if the way I sketched above is generally applicable and whether there might be a strong reason to go with the long way…. However, as you said, its not contained in the ROM and therefore irrelevant.
Thank you for the clarification, William.
Actually, there’s no particular reason to need to go all the way to the right. It would work just as well to align the binary point with the middle of the mantissa.
It looks like a code space saving technique. They already had the “shift the binary point to the right of the mantissa” code to implement INT() (it short circuits if the exponent is high enough that there’s no fraction). Basically, the logic is to shift the binary point to the right, then renormalize. That will effectively get rid of the fraction without having to worry about trying to mask things off starting at a random bit. Since they already had the code that does that, they could just repurpose that instead of just going half way (which would have taken the same amount of code again).
For maximum efficiency, INT() should work out which byte the binary point falls in, work out the mask for the relevant bit with a lookup table, and then just mask off the unneeded bits. No need to denormalize or renormalize. It could even just use a lookup table for byte offsets and bit masks if it wanted. That would be really fast even if it would a fairly sizeable lookup table.
Integer conversion should align the binary point to the right of the highest mantissa byte where the desired integer size fits to the left, grab the result, and *then* negate it if it was negative as opposed to negating first. You could even skip the floating point “is it -32768” check there by doing an unsigned conversion and checking the range issues on the integer side. It would be faster than doing a floating point comparison.
Fully agree. As you said, masking of the fraction for INT() should be better done with a table. Taking the lower 3 bits from the exponent a 8-byte table could suffice. I think about something like $00,$80,$C0,$E0,$F0,$F8,$FC,$FE. Alas, such a table seems not to exist in the ROM for reuse. This and the extra code to handle the masking would be too much (maybe possible for 6809, but not for 6502 which has more code size).
My theory is that MS won’t to nail it down to a special mantissa size and structure and let the integer size open (even though, AFAIK no interpreter uses more than 16 bit integers especially where floating point exists ). The negation operation on the full mantissa before the de-normalization could be an indication for this in my eyes.
I suspect a lot of the stuff in Color Basic that seems illogical is down to limitations of less capable CPUs carrying forward. I know of at least two cases that can be substantially improved over the code that’s actually there by taking proper advantage of the 6809’s capabilities. Including, for instance, the stack search routine for FOR/GOSUB records which goes out of its way to use abx when it could just use leax 18,x and avoid burning B and then it could use D for comparing variable descriptor addresses instead of a funky dance shunting X all over the place.
Yeah, thanks for the pointer – really nice this awful ROM part … maybe a Z80 programmer translated the 6502 code to 6809 … :>
What CPU was the first Microsoft BASIC for? Altair?
Your are right, MS Basic derived from Altair Basic on a 8080, after that ported to 6502 named Microsoft Basic. Nice reading is http://www.6502.org/users/andre/petindex/roms.html#hist
Some time ago I took a closer look on the garbage collector to benchmark it (compared to 6502 variants). Again, I looked over it and your assertion above holds even there. Perhaps my 6809 skills had evolved in the last 2 years, suddenly, the lack of a real 6809 coding style became obvious to me. Instantly I saw some possibilities to optimize the code and could save 7 bytes (for 33 byte long subroutine). How does it come? Thinking about the ROM style, the X register centric operation and going most of the time through direct page locations leads me to the presumption, that the at least the core part of the interpreter has been ported from a 6800 source (which came up before the 6502 version, I think). Compared to the 6502 code, which is that tight, not even one spare byte can be found, except for intentionally left unused places or without large-scaled re-implementation of functional parts.
However, the 6809 porting guys were happy to get the nearly half the code size compared to the 6502 even in 6800-style – they aren’t forced to take a closer look on the code for further optimizations regarding time and space …