Today we will explore writing a standard base-64 converter in BASIC, and then see if we can make a smaller and faster (and nonstandard) Color-BASIC-specific one.
When we last left off, we were looking at ways to get as much encoded data on to a DATA statement as possible. Instead of using integer numbers (base-10) or hex values (base-16), we began exploring if we could increase the base and use more typeable character to encode the data.
Although it seems we could create a weird base-90 format using every typeable character except for quote (which we’d need to start a DATA line else we couldn’t use comma), the decoder would be much larger and have to do much more work, and we actually wouldn’t benefit since we really need numbers that round to specific numbers of bits:
- Base-8 (octal) values can be represented by 3-bits (111). (Extended BASIC supports octal when you use &Oxx or just &xx.)
- Base-16 (hexadecimal) values can be represented as 4-bits (1111). (Extended BASIC supports hexadecimal when you use &Hxx.)
- Base-32 values would be represented as 5-bits (11111).
- Base-64 values would be represented as 6-bits (111111).
- Base-128 values would be represented as 7-bits (111111).
As you can see, a base-90 value isn’t a large enough range to give us an extra bit over base-64. We need to use bases that are nice multiples of the power of 2. Because of this, we’ll ignore a made-up base-90 and look at something a bit more standard, such as base-64 encoding.
Pump up the base
As previously discussed, natively, you can represent a number in a DATA statement as a base-10 value, or a hexadecimal value. Both of these are the value 32:
100 DATA 32,&H20
BASIC will READ them the same way, though hex values are much faster for BASIC to read and parse. Using native hex values like “&H20” is the fastest way to load DATA, but it is also the largest since every value has two extra characters (“&H”) in front.
A recent tip was given by Shaun Bebbington about how you can represent zero just by leaving it out between commas. It saves space, and the parse gets zero from this faster than if you put a zero there:
100 DATA 8,6,7,5,3,,9
But since we are trying to get as much DATA in there as possible, we don’t want to separate numbers by commas. We can pack all the 2-digit hex values together in a string then read that entire string and parse out the individual 2-digit hex values. That is more work, and slower, but gets more data per DATA line. Here are the values 0 to 15 in hex (00 to 0f):
100 DATA 000102030405060708090A0B0C0D0E0F
As previously demonstrated, this is the most efficient way to store HEX values. Even when we pad a low 0-15 value to make it two digits (1 represented by 01), it stills saves space over comma delimited values since no commas are used.
But each hex value is wasting 50% of the bits it takes to represent it. HEX values of 0-15 could be represented by four bits (0000 to 1111). We are storing them as one 8-bit character and thus achieving 50% storage efficiency.
We can do better by using a higher base-x value that can use those wasted bits. We want the highest value we can represent with typeable characters, which is 64 (since the next higher would be 128 and we don’t have a way to type 128 different characters on the CoCo).
The standard Base-64 encoding uses the following 64 characters to represent values of 0 to 63:
Each base-64 character needs 6-bits to be represented (000000-111111).
Representing values that way only wastes 2 bits per character, rather than 4-bits like hex base-16 does:
ASCII HEX Chars.: ASCII Base-64 Chars.: 0 15 0 63 "0" "F" "A" "/" / \ / \ xxxx0000 xxxx1111 xx000000 xx111111
But, converting to and from base-64 is much trickier. Hex base-16 is as simple as this:
- Hex “F0” -> F is 15 which is 1111 in binary. 0 is 0000 in binary. Thus the first character becomes the left four bits, and the second character becomes the right four bits. Super easy. Barely an inconvenience. Two ASCII bytes represent one byte of data.
But for base-64, we are dealing with 6-bits, and two of those won’t fit into an 8-bit byte. Instead, four base-64 6-bit values are merged together to make a 3-byte 24-bit value.
- Base-64 “ABCD” (xx000000 xx000001 xx000010 xx000011) -> A is 0 which is 000000 in binary. B is 1 which is 000001 in binary. C is 2 which is 000010 in binary. D is 3 which is 000011 in binary. These values are merged together (removing the unused 2-bits in each one) and stored in 3 bytes as:
+- Byte 1 --+- Byte 2 --+- Byte 3 --+ | 000000|00 | 0001|0000 | 10|000011 | | \__A__/\___B___/ \___C___/ \_D__/
- Byte 1 contains 6 bits of base-64 value A and 2 bits of base-64 value B.
- Byte 2 contains 4 bits of base-64 value B and 4-bits of base-64 value C.
- Byte 3 contains 2 bits of base-64 value C and 6 bits of base-64 value D.
Well that’s a mess. Moving bits around like that is super easy under languages like C, but a bit more work in BASIC.
We will start with encoding a simple ASCII string into base-64 using a web tool:
If you go to that link, you can type something in and then encode it into base-64. I typed:
Greetings from Sub-Etha Software! Do you know where your towel is?
And that gets encoded into this:
Each character represents a 6-bit (0-63) value which we will have to combine into 8-bit values and decode.
An easy way to decode the characters used by base-64 encoding is with a string:
We can use Extended BASIC’s INSTR() function to match a character from the encoded string with a character in that string, and the position it is found in will the the value it represents (well, minus 1, since INSTR returns a base-1 value).
Here is an example that will display the bytes of the encoded string:
0 REM base64-1.bas 10 Z$="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" 20 READ A$:PRINT A$ 30 FOR A=1 TO LEN(A$) 40 PRINT INSTR(Z$,MID$(A$,A,1))-1; 50 NEXT 1000 REM BASE-64 DATA 1010 DATA R3JlZXRpbmdzIGZyb20gU3ViLUV0aGEgU29mdHdhcmUhIERvIHlvdSBrbm93IHdoZXJlIHlvdXIgdG93ZWwgaXM/
Running that shows me this:
If A is 0, then R should be 17, and that is what it prints first. Now we know we can get the values for each character in a base-64 encoded string.
Next we have to turn four 6-bit base-64 values into three bytes (24-bits). I am not sure what a good way to do this is, so I’ll just brute-force it and see how that works out.
First, I know that I need four base-64 values to make my 3 8-bit values, so I’ll modify my loop to skip every four values, and then add an inner loop to process the individual four base-64 values.
Inside that inner loop it will process the next four base-64 6-bit values and convert them into 3 8-bit values.
Here is what I came up with:
0 REM base64.bas 5 POKE65395,0 10 Z$="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" 20 READ A$ 30 FOR A=1 TO LEN(A$) STEP 4 35 REM GET 4 6-BIT VALUES 40 FOR B=0 TO 3:B(B)=INSTR(Z$,MID$(A$,A+B,1))-1 50 IFB(B)<0 THEN B(B)=0 60 NEXT 65 REM CONVERT TO 3 8-BIT 70 C1=INT(B(0)*INT(2^2)) OR INT(B(1)/INT(2^4)) 80 C2=(B(1) AND &HF)*INT(2^4) OR B(2)/INT(2^2) 90 C3=(B(2) AND &H3)*INT(2^6) OR B(3) 100 PRINT CHR$(C1);CHR$(C2);CHR$(C3); 110 NEXT 120 END 1000 REM BASE-64 DATA 1010 DATA R3JlZXRpbmdzIGZyb20gU3ViLUV0aGEgU29mdHdhcmUhIERvIHlvdSBrbm93IHdoZXJlIHlvdXIgdG93ZWwgaXM/
I figured out all the 6-bit to 8-bit stuff (lines 70-90) with alot of trial and error, so I expect there is a faster and easier way to do this. But, then end results is a program that will print out the expected message, albeit really slowly.
One unexpected problem was with the powers of two — (2^2) and such. They produce rounding errors which caused some bits to be lost. I had to use INT() round them. That took me hours to figure out, but it’s just part of the inaccuracies of floating point values, especially limited ones like a 1970s BASIC used.
PROBLEM: Since the goal here is to put more data in DATA statements, the base-64 decode routine needs to be small. If it is 100 bytes larger than just using HEX, you have to save 100 bytes in DATA before you break even. The routine I give is not small and not fast. It would probably not be useful in the 10 LINE contest I mentioned. Maybe one of you can help improve it.
Now that we have a simple base-64 decoder, the next step will be making an encoder to turn DATA statement values into a base-64 string.
Until next time…