Compressing BASIC DATA with Base-64 – part 1

See also: part 1, part 2, part 3 and part 4.

NOTE: This article was started a year or two ago, so some references may no longer be pertinent.

NOTE 2: It was then updated in April 2020 before finally being published in November, so some references in the updates may no longer be relevant.

There is some kind of “10 line BASIC” contest, and one of the categories allows for assembly language as long as it can be embedded in a typeable BASIC program. I previously discussed embedded assembly in BASIC in my Interfacing BASIC with Assembly series.

One of the examples I gave was a Pac-Man maze demo that used some assembly code to scroll the screen up and down. The loader looked like this:

2000 REM
2001 REM LOAD ASSEMBLY ROUTINE
2002
2010 READ A,B
2020 IF A=-1 THEN 2070
2030 FOR C = A TO B
2040 READ D:POKE C,D
2050 NEXT C
2060 GOTO 2010
2070 RETURN 'END
2080 DATA 16128,16217,189,179,237,90,39,14,90,39,28,90,39,42,90,39,55,204,255,255,32,67,142,4,32,166,132,167,136,224,48,1,140,5,255,47,244,32,47,142,5,223,166,132,167,136,32,48,31,140,4,0,44,244,32,30,142,4,1,166,132,167,31,48,1,140,5,255,47,245,32,14
2090 DATA 142,5,254,166,132,167,1,48,31,140,4,0,44,245,204,0,0,126,180,244,-1,-1

That particular bit of BASIC code was created by the LWASM assembler. It reads assembly language values from DATA statements and POKEs them into memory where they can be executed.

In another series, I discussed various ways to use DATA statements either for the fastest reading/loading or the smallest size.

Today, I’d like to revisit this subject and offer some more ways to compress data into DATA to store even more than before.

Or something like that.

Knowing our limitations

We know the BASIC input buffer is 249 characters long, so we should be able to type in a single digit line number, the keyword “DATA” and 244 more characters.

BASIC allows typing in up to 249 characters.

Above, I have a one digit line number (“0”), then four characters for the keyword (“DATA”) then seven full lines of 32 characters (32*7=224) plus 20 characters on the final line – so 1+4+224+20 is 249. Hey, it works!

Since loading a BASIC program saved in ASCII is the same as typing it in, that limit should also apply for loading a non-tokenized ASCII program. I will be using that method here to load test programs into the XRoar emulator, so I won’t be able to cheat and load a tokenized line that would have exceeded our typing limit.

If we encode our assembly code as 2-digit hex values, we should have room to type a single digit line number, the keyword “DATA”, and 122 2-digit hex values. We could enter all the hex digits from &H00 to &H79 on a line. This appears to work:

The BASIC input buffer is 249 characters, so that’s the most you can type before pressing ENTER.

As soon as we press enter, BASIC tokenizes the “DATA” keyword. It no longer takes up four bytes. This means even though we typed the full 249 characters, we could EDIT the line and perhaps type a few more characters. Typing “EDIT 0″…

…then pressing “X” to extend (go to the end) of the line allows us to add two more digits:

In EDIT mode, there are now two more characters available since DATA has been tokenized.

I am guessing DATA is tokenized into a 2-byte token. However, if you add these two characters, then re-list the line:

…we see BASIC does not show the final character. However, if you use the READ command to read that line in to a string, and the PRINT it, you see it’s actually there:

BASIC is storing all the characters, but LIST cannot show them.

The BASIC LIST command has a limit to how much it will display, it seems, and it is one character less than it should be. Bug?

BASIC line packing

It is possible to create BASIC lines that contain much more data than you can type in. They will run fine, but cannot be fully listed. There were several BASIC “crunch” programs available, including one by Carl England that I used often, that did this trick.

However, if we want to stick to how much someone could type in, we need to limit ourselves to that 249 buffer, and not rely on doing the EDIT trick to add more to it.

If that is the case, the most amount of HEX encoded assembly language bytes you can fit on a BASIC line is 122 per single-digit line. The code to read in those lines and POKE them in to memory can easily fit into one line as well, so we could easily fit a 1098 byte assembly language program into a ten line BASIC program.

And, we could even stick a few more bytes on the end of the loader line. Using a simple test program, I figured I could get the loader plus 1167 bytes of assembly code POKEd into memory. It looked like this:

0DATA000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778
1DATA797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1
2DATAF2F3F4F5F6F7F8F9FAFBFCFDFEFF000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A
3DATA6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3
4DATAE4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C
5DATA5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5
6DATAD6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E
7DATA4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7
8DATAC8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F40
9 L=16128:FORA=0TO9:READA$:FORB=1TOLEN(A$)/2:POKEL,VAL(MID$(A$,B*2,2)):L=L+1:NEXT:NEXT:DATA4142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E
100 BT=0:FORA=0TO9:READA$:FORB=1TOLEN(A$)/2:PRINTMID$(A$,B*2,2);:BT=BT+1:NEXT:NEXT:PRINTBT"BYTES":END

Notice LINE 100. It is not part of the program. I can load this and type “RUN 100” to get it to dump all the data and show me what it would be POKEing into memory, as well as the final byte count. For a contest entry, it would only include lines 0-9. (Using the EXIT/eXtend trick, you could probably get another 9 or 10 bytes total, but they might consider that cheating, and maybe someone could optimize the loader code to make even more room.)

We can do better…

Over on the Facebook group, someone (who was it?) suggested using some alternative encoding to get even more data in to the DATA statement. There are existing forms of doing this with 7-bit characters, using the printable range of characters.

The wikipedia has a page showing many different implementations of binary to text encodings.

Since we need to encode things that can be typed in, this puts some limits on what we can do. Characters 0-32 are control characters, so normal printable text characters start with ASCII 32 (space) and go to ASCII 127.

See https://en.wikipedia.org/wiki/ASCII for the list.

There are some other characters in this range that we cannot type on a CoCo keyboard. Here is the printable CoCo character set:

Printable CoCo characters from ASCII 32 to ASCII 127.

We can type the inverted (lower case) letters “a” through “z”, but we have no way to type the inverted @ symbol, or the inverted left bracket, backslash, right bracket, up arrow or left arrow.

But, just because they are typeable does not mean we could use them all in a DATA statement. Quote, for instance, is okay if it’s in the middle of a string that is read…

10 READ A$:PRINT A$
20 DATA ABC"DEF

…but if the quote started up at the beginning of a string of characters, BASIC would skip it and join everything after it together until it sees another quote or an end of line:

10 READ A$:PRINT A$
20 DATA “ABCDEF

Also, comma is a problem. READ separates data using the comma UNLESS the data is quoted. This would only read the first ABC:

10 READ A$:PRINT A$
20 DATA ABC,DEF

…but this would read “ABC,DEF” as one string:

10 READ A$:PRINT A$
20 DATA “ABC,DEF”

So even though we can type about 91 characters, we can’t use all of them in a DATA statement.

Reduce the bass! (Er, base…)

It seems we could achieve a base-90 (?) format by having each line start with a quote (thus we could include a comma in the DATA), but the decoder would have to be much larger.

So instead of that, we’ll turn the base down a bit and do something a bit easier in the next installment.

Until next time…

MiaM November 12, 2020 at 2:02 am

Re “invalid”/special characters: Does the basic code that decodes the data even need to use READ? It could just as easy just traverse the program stored in memory. That would allow for example ” and ,

Also if it saves basic code space, that program could incorrectly traverse the actual basic code instead of only the trailing data statement on the “decoder line” to save code space. The resulting data written to memory would then have to be treated as having a hole filled with junk (basexx-decoded tokenisized basic code). It might also be worth checking if it’s even worth dealing with the separate lines when decoding, or just peek through about 2,5k of basic code memory and decode it as if it all is correctly encoded data even though it would be line numbers, the “data” token itself, the decoder and whatnot. By doing that and ending the decoder with a SYS, you might even be able to have the encoded data without a leading statement as that part of the program will never be executed, just read with peek. A problem with not having any basic statement is how to enter data that would be interpreted as tokens. Probably best to use a DATA or REM statement after all, or maybe use one of those where it is needed?

It would require a bit of cleverness in the coding of the assembler program to make sure that the “junk data” parts are unused or used for uninitialized data. Not super hard though as any decent assembler should make it possible to assign blocks of code to end up at different spaces, assign those “junk” block as containg uninitialized data and have the assembler spit out an error message if something incorrectly overlaps.

Bonus: There is a chance that a 1-2k basic program not uses every byte value 0-255. By wirting at least some kind of preliminary assembler program you could evaluate what data is actually needed to be possible to store, and perhaps ignore that some values at certain places might nog be writeable.

Extra bonus: I don’t know 6809 that well, but is there perhaps even a chance of writing code that mostly/only corresponds to writeable characters? (Compare with that this was the only way to be able to load/save machine code programs on a ZX81, stored as “random chars” in a REM statement, afaik also making some binary value impossible (whatever was used to indicade end of line)). I assume not, but might be worth investigating. Perhaps parts of the assembler program can be stored just as “binary text”, and some parts could “almost” be stored that way, i.e. “text” with some instructions for the basic program (or initial assembler program) to patch the binary to make it correct.

All this is making it harder and harder as the actual wanted program and the encoding kind of needs to be written in parallell, having dependencies on each other.

Reply ↓

13 thoughts on “Compressing BASIC DATA with Base-64 – part 1”

William Astle November 11, 2020 at 1:52 pm

Just a detail about the extra characters allowed by EDIT: it has nothing to do with tokenization. It’s because EDIT uses the entire line input buffer for line text. However, when typing in a line (or reading it from an ASCII file), the line number takes up space, as does the space after it if present. I think there is also be another oddness that allows EDIT to use one more byte than the line input routine.

Thus, in your example, one extra byte comes from the line number and one comes from some internals related to how EDIT works compared to line input.

Keep in mind that EDIT has to detokenize the line before it can allow you to edit it in the first place, and that has to go somewhere, and that somewhere is the line input buffer.

Loading...

Reply ↓
1. Allen Huffman Post authorNovember 11, 2020 at 1:53 pm
  
  Are you certain about this? Let me do some quick tests.
  
  Loading...
  
  Reply ↓
2. Allen Huffman Post authorNovember 11, 2020 at 2:30 pm
  
  Looks like tokens to me. Do a PRINT”xxxx” to the end, so quote is the last type able. ENTER. Then EDIT and you get four characters extra you can type.
  
  Loading...
  
  Reply ↓
  1. William Astle November 30, 2020 at 12:08 pm
    
    It’s definitely not tokens. You’ll never get more than a half dozen extra characters in there with edit. The “?” shortcut for PRINT can let you get a crazy amount of extra stuff in a line if you use multiple PRINT statements, though. And then you won’t be able to edit or list the line properly since it will overflow the buffer when detokenizing for listing or editing.
    
    Loading...
    
    Reply ↓
    1. Allen Huffman Post authorNovember 30, 2020 at 12:11 pm
      
      It seems to give more bytes with a longer keyword than with just typing non tokens, though?
      
      Loading...
      
      Reply ↓
      1. William Astle November 30, 2020 at 12:31 pm
        
        Here’s the evidence that it isn’t tokens:
        
        In the ROM code for EDIT, at address $8543 there is a JSR $B7C2. $B7C2 is the “detokenize line to input buffer” routine, also used by LIST (and ASCII saves and LLIST, which work by redirecting LIST to a device other than the screen).
        
        Elsewhere in the EDIT code, it displays the line number separately (with a call to $BDCC) which LIST also does.
        
        Editing of the line itself works solely on the text contents of the line input buffer after that. Which does not contain the line number.
        
        When EDIT is done, it jumps into the middle of the immediate mode loop where the line number has already been parsed but the line contents hasn’t yet been tokenized. Then the usual “insert a line into the program” sequence occurs (delete old line, make a hole for the new line, insert new line).
        
        Loading...
      2. Allen Huffman Post authorNovember 30, 2020 at 12:32 pm
        
        What causes the extra bytes to appear, then?
        
        Loading...
MiaM November 12, 2020 at 2:02 am

Re “invalid”/special characters: Does the basic code that decodes the data even need to use READ? It could just as easy just traverse the program stored in memory. That would allow for example ” and ,

Also if it saves basic code space, that program could incorrectly traverse the actual basic code instead of only the trailing data statement on the “decoder line” to save code space. The resulting data written to memory would then have to be treated as having a hole filled with junk (basexx-decoded tokenisized basic code). It might also be worth checking if it’s even worth dealing with the separate lines when decoding, or just peek through about 2,5k of basic code memory and decode it as if it all is correctly encoded data even though it would be line numbers, the “data” token itself, the decoder and whatnot. By doing that and ending the decoder with a SYS, you might even be able to have the encoded data without a leading statement as that part of the program will never be executed, just read with peek. A problem with not having any basic statement is how to enter data that would be interpreted as tokens. Probably best to use a DATA or REM statement after all, or maybe use one of those where it is needed?

It would require a bit of cleverness in the coding of the assembler program to make sure that the “junk data” parts are unused or used for uninitialized data. Not super hard though as any decent assembler should make it possible to assign blocks of code to end up at different spaces, assign those “junk” block as containg uninitialized data and have the assembler spit out an error message if something incorrectly overlaps.

Bonus: There is a chance that a 1-2k basic program not uses every byte value 0-255. By wirting at least some kind of preliminary assembler program you could evaluate what data is actually needed to be possible to store, and perhaps ignore that some values at certain places might nog be writeable.

Extra bonus: I don’t know 6809 that well, but is there perhaps even a chance of writing code that mostly/only corresponds to writeable characters? (Compare with that this was the only way to be able to load/save machine code programs on a ZX81, stored as “random chars” in a REM statement, afaik also making some binary value impossible (whatever was used to indicade end of line)). I assume not, but might be worth investigating. Perhaps parts of the assembler program can be stored just as “binary text”, and some parts could “almost” be stored that way, i.e. “text” with some instructions for the basic program (or initial assembler program) to patch the binary to make it correct.

All this is making it harder and harder as the actual wanted program and the encoding kind of needs to be written in parallell, having dependencies on each other.

Loading...

Reply ↓
Erico Monteiro November 17, 2020 at 7:53 am

Very interesting, I will be keeping an eye on this.
I have a vague ghostly memory that one could type the inverted @ using a combinations of keys, or was it the inverted up arrow? I would not bet on this though.

Loading...

Reply ↓
1. Allen Huffman Post authorNovember 17, 2020 at 7:58 pm
  
  I don’t recall. I need to see what all types in lowercase. The CC1/2 keyboard had the arrow keys and one or two produced something on the screen, but I don’t know about anything else but shift.
  
  Loading...
  
  Reply ↓
Pingback: Compressing BASIC DATA with Base-64 – part 2 | Sub-Etha Software
Pingback: Compressing BASIC DATA with Base-64 – part 3 | Sub-Etha Software
Pingback: Exploring Atari VCS/2600 Adventure – part 4 | Sub-Etha Software

Sub-Etha Software

"In Support of the CoCo and OS-9 since 1990!"

Compressing BASIC DATA with Base-64 – part 1

Knowing our limitations

BASIC line packing

We can do better…

Reduce the bass! (Er, base…)

Like this:

Related

13 thoughts on “Compressing BASIC DATA with Base-64 – part 1”

Leave a Reply to William AstleCancel reply