Color BASIC String Theory, part 1

Many aspects of Microsoft Color BASIC have always been magic to me. I never gave too much thought to how the BASIC interpreter does what it does.

String variables are one thing we all had to know a bit about. By default, Color BASIC sets aside 200 bytes for strings. If you use more, or less, you learned about using the the CLEAR command:

CLEAR 1000 'RESERVE 100 BYTES FOR STRINGS

CLEAR erases all variables, so it has to be done before you define any:

10 A=1
20 PRINT A
30 CLEAR
40 PRINT A
RUN
1
0

We can test CLEAR from BASIC.

If we reserve only ten bytes for strings:

CLEAR 10

…then we should be able to create a ten-character or less string:

A$="1234567890"

But, if we try to create anything longer than ten characters, we get an Out of String Space error (?OS ERROR):

CLEAR 10 means 11-character strings need not apply.

Any variables you create from the command line will be stored in string memory. But, strings can also be embedded in a BASIC program and not use any string memory:

10 CLEAR 10
20 A$="THIS STRING TAKES UP NO STRING MEMORY"

In the above example, A$ points to the text embedded inside the BASIC program. It can, therefore, be as long as a string can be since it takes up no string memory.

Consider this example… If you have 30 bytes reserved for strings, you should be able to make three 10-byte strings in memory:

10 CLEAR 30
20 A1$="1234567890" 'WORKS
30 A2$="1234567890" 'WORKS
40 A3$="1234567890" 'WORKS
50 A4$="X" '?OS ERROR!

Oops!

If you run that, it works just fine because those strings are being stored in the program, and NOT in string memory. I caught myself not paying attention to my own article!

To force a string to live in string memory, we have to do something to force BASIC to move it there. If the string is NOT declared as a contiguous string of text, BASIC has no choice but to make it in string memory:

20 A1$="12345"+"67890"

That is enough to do it. A1$ will be represented as “1234567890” somewhere in string memory.

And BASIC is not too smart. In this example, it looks like BASIC still COULD keep it in program memory (the string is contiguous), but it does not:

20 A1$="1234567890"+""

That will force the string in to string memory, and now the program will fail with an ?OS ERROR on line 40:

NO STRING MEMORY FOR YOU!

With that said, let’s see what we must do to avoid a program from crashing due to running out of string space…

First, we need to understand how much string memory the program needs and make sure we CLEAR that much. Suppose we wanted the user to enter a password of up to 10 characters. We could do this:

10 CLEAR 10
20 LINE INPUT "PASSWORD?";A$

If we run that, we can type up to 10 characters and press ENTER and it should work fine. BUT, if the user typed more than 10 characters, it will crash with an ?OS ERROR.

The INPUT and LINE INPUT commands actually lets you type up to 249 characters before it stops you, so if the programmer didn’t CLEAR extra string space, typing in a long line might be enough to crash the program.

As a programmer, if we plan to use INPUT, we need to have 249 bytes more than whatever string space we plan to use.

If we are going to be doing any string manipulation, we need to leave extra room for that, too.

Thus, if I was expecting the user to type in a 10 character password, and a 20 character username, and I know INPUT allows the user to type up to 249 characters, I guess I need this:

10 CLEAR 10+20+249

BUT, it will be important that the username and password strings are limited to just 10 and 20 characters. We can use LEFT$ to trim them down in case someone “accidentally” types too much:

10 CLEAR 10+20+249
20 LINE INPUT "USERNAME:";NM$:NM$=LEFT$(NM$,10)
30 LINE INPUT "PASSWORD:";PW$:PW$=LEFT$(PW$,20)

That code should now be bulletproof against a user typing in long strings at both those input prompts.

Side Note: The 249 value may be specific to the Microsoft Color BASIC on the Radio Shack Color Computer. Other versions of BASIC may have different input lengths and differences. I’d be curious to know what is similar or different on Atari BASIC, Commodore BASIC, etc.)

Basically, we must make sure our CLEAR command covers the maximum characters the variable is allowed to contain, plus any extra required by something like INPUT.

However, there are other things that use temporary string space we must also account for. Suppose we were going to combine the username and password along with a numeric access level (say, 0-255) in to a string, using some character to separate them:

USERNAME\PASSWORD\255

This is basically what I did for my 1983 *ALLRAM* BBS for the userlog. Rather than using three different arrays – an array of usernames, an array of passwords, and an array of levels – wanted to just use one string array. This was because every variable (or array element) uses 5-7 bytes of memory (more details on this in a future article). The less variables, the less memory you use, so combining multiple elements in to one string saved me quite a bit.

My maximum size for a userlog “record” becomes:

  • 20 bytes for the username
  • 1 byte for the delimiter character
  • 10 bytes for the password
  • 1 byte for the delimiter character
  • 3 bytes for the level number

A fully loaded userlong string would be:

12345678901234567890\1234567890\123 = 35 bytes

Therefore, if I was going to make such a record, I’d need 35 bytes for each record I allowed. For example, if I wanted five users, I’d need to reserve enough memory for five of those userlog strings, plus room to input a username and password, and the extra overhead for the INPUT buffer just in case the user tries to type too much:

10 CLEAR 5*35+20+10+249
20 DIM UL$(4) 'BASE 0, SO 0-4
30 FOR A=0 TO 4
40 PRINT "USERNAME";A;":";
50 LINE INPUT NM$:NM$=LEFT$(NM$,20)
60 PRINT "PASSWORD";A;":";
70 LINE INPUT PW$:PW$=LEFT$(PW$,10)
80 UL$(A)=NM$+"\"+PW$+"\0"
90 NEXT

If you run that, you can enter a username and password for each of the five users. There is enough memory reserved that even if every name uses the full 20 characters, and every password the full 10, it still shouldn’t crash from an Out of String Space error.

BUT … we got lucky. That extra 249 bytes we reserved for the INPUT buffer is actually keeping this from crashing on line 80. This is because line 80 has to create a new temporary string, so there has to be room to hold the original NM$ and PW$ plus room to create the new copies of those strings with the delimiter characters and the level at the end. Even though this string only exists for a moment, it still has to exist somewhere.

Let’s remove the 249, and ONLY TYPE the max characters (10 or 20):

10 CLEAR 5*35+20+10

Now if we run, we get a different result:

Sorry my program crashed, sir. I thought I covered all the string memory usage…

Even though we had enough memory to hold our temporary 20 character username (used by INPUT) and 10 character password (also used by INPUT), plus enough memory to hold the combined userlog strings, line 80 needed even more to create the new string.

Usually we just give CLEAR more memory and be done with it, which is fine for casual BASIC programming and if you have plenty of memory. But if you are needing every last byte of memory for a huge program or one with tons of variables and strings, maybe we can’t waste any.

I wondered how much this needed, so I wrote this test:

10 CLEAR 20+10+35
20 NM$="12345678901234567890"+""
30 PW$="1234567890"+""
40 UL$=NM$+"\"+PW$+"\255"

I used the +”” to force the strings to be placed in string memory.

It looks like this should work since the new UL$ will only contain the 20 character username plus 1 character delimiter plus 10 character password plus 1 character delimiter plus 3 character level…

But, during the process of building the UL$, it seems multiple temporary strings are created.

From trial and error, I found that adding 31 bytes was enough to make it work:

10 CLEAR 20+10+35+31

31 bytes doesn’t really fit anything we have. That’s the 20 character name plus delimiter plus password … but what about the rest? There’s still the delimiter plus three character level at the end: “\255”

Look at line 40 … see the “\255”? That’s where the extra four bytes are. It looks like BASIC doesn’t need to count that string since it’s already in the code. It still needed one byte for the first delimiter, though, so perhaps being at the end matters?

If I try this:

40 UL$=NM$+"\255\"+PW$

…then it works with 25 extra bytes. Er… It looks like it’s not that the last part was a four character string. It looks like it’s just the “last” element. In this case, the password is 10 characters, so that would get us back to our 35 character string size we reserved.

Let’s test this further:

40 UL$=PW$+"\255\"+NM$

Now it works with 15 extra bytes. That’s the password (10 characters) plus the bit in the middle (5 characters, “\255\”).

This tells me that whatever string manipulation is happening, BASIC only needs string memory for everything but the last manipulation. Thus, if I had something like this:

10 CLEAR 11+10
20 A$="12345"+"67890"+"X"

As I expected … I need enough reserve string space for it to build the first part of the string in a temporary location, before doing the final assignment of temporary + final bit.

Thus:

10 CLEAR 10+9
20 A$="123"+"456"+"789"+"0"

Now that I work through it, it makes perfect sense. Above, BASIC is creating a temporary string and doing this:

  • store “123” in temporary string (usage: 3 bytes)
  • store “456” in temporary string (usage: 6 bytes)
  • store “789” in temporary string (usage: 9 bytes)
  • copy temporary string to A$ and append “0” (usage: 10 bytes)

This seems to be a good educated guess, but could be confirmed by consulting the source code.

I guess it looks like the amount of temporary string usage can be predicted just by looking at all the places in the code where string manipulation is done, and calculating which one is the largest and adding that much extra string space.

BUT, there are other rules for things like LEFT$, MID$, RIGHT$, HEX$, etc.

10 CLEAR 9+4
20 A$="12345"+HEX$(65535)

Above, that needs 4 extra bytes to create the HEX$ string (“FFFF”) and then append it.

10 CLEAR 10+5
20 A$="12345"+LEFT$("67890ABCDEF",5)

Above, we need 5 extra bytes for the temporary 5 byte string that LEFT$ is going to create and append.

10 CLEAR 10+5
20 A$="12345"+MID$("ABCDEFG",1,5)

Above, 5 extra bytes are needed because MID$ creates a 5 byte string to append.

Now it all makes sense.

Wow. That was one heck of a tangent.

I am going to say it’s probably save to just add an extra 255 to whatever string space you think you are using, since no string can ever be larger than 255. I guess I’ll have to test that sometime, too.

Next time, I’ll get back to my point…

6 thoughts on “Color BASIC String Theory, part 1

  1. William Astle

    String manipulation is, in fact, entirely predictable, and logical. I think I’ll do an article on the internals of string evaluation and manipulation in Color Basic. It’s actually fairly straight forward, but the whys and hows of it get a bit complex.

    Reply
  2. MiaM

    Commodore Basic just grows string space downwards from memtop and other variables grows upwards from the end of the basic source code. Thus there is no need for a clear statement like this. I think all Microsoft basics for 6502 behave in this way.

    Is there any reason for the 6809 basics to not do it this way? Perhaps something with pointer handling inside 6809 making it more efficient to do it this way?

    Regarding how long strings you can input, it is different in different Commodore basics. VIC-20 lets you enter 88 chars (4 rows of its tiny 22 colums wide screen). C64 lets you enter 80 chars (2 rows of 40 colums) and iirc C128 lets you enter 160 chars (4 rows of 40 cols or 2 rows of 80 cols). My memory might not be perfect, perhaps the last character space is needed for being able to press return while still beeing inside the limit or something like that so it might be 87, 79 and 159 chars instead of 88, 80 and 160.

    Reply
    1. Allen Huffman Post author

      From William Astle’s reply, I gather than the reserved string space was used to control when the garbage collection would be triggered. On the 6502 BASIC, you didn’t need to specify? You could use 20K of string space if you wanted?

      The different line lengths is interesting as well. I think it was William Astle that also pointed out the reason the CoCo length was what it was was that it used some buffer area used for other things as well, and that was as much memory as was available there. (Thus, why it’s not 255, the length of the line.)

      I plan to use a Commodore emulator (VICE, or something web-based) to try the same benchmark tests there. I am curious if the interpreters are similar enough that the same tricks give similar speed differences on 6502.

      Reply
    2. William Astle

      String space in Color Basic grows downward, too. And the variable table grows upward starting at the end of the program.

      On the 6809, the stack pointer is 16 bits and can be anywhere so there’s no artificial limit on how big the stack can grow. On the 6502, this is not the case. The stack is in a fixed location and has a maximum size (8 bit stack pointer). That means you don’t have that third item that has to be in “free” memory.

      Because the 6809 stack can be anywhere and can grow arbitrarily large, it is useful to put it somewhere that can take advantage of that fact. And there’s no reason not to use it for GOSUB/RETURN and FOR/NEXT records as well as holding the recursion records during expression evaluation.

      That means Color Basic has three things in memory. The variable table growing upward. The stack growing downward. And string space which grows down (but could grow upward if desired). The ideal situation would be to have all three grow arbitrarily large. However, a bit of throught will show that can’t be the case.

      Of the three conflicting items, the one that is easiest for typical programmer to identify requirements for is string space since that requires less understanding of the internals of the interpreter.

      On the 6502, the hardware limits the stack for you so there’s no conflict and as a result you don’t need to bother with any sort of limits on the other two items.

      Reply

Leave a Reply