Compressing BASIC DATA with Base-64 – part 2

See also: part 1, part 2, part 3 and part 4.

Today we will explore writing a standard base-64 converter in BASIC, and then see if we can make a smaller and faster (and nonstandard) Color-BASIC-specific one.

When we last left off, we were looking at ways to get as much encoded data on to a DATA statement as possible. Instead of using integer numbers (base-10) or hex values (base-16), we began exploring if we could increase the base and use more typeable character to encode the data.

Although it seems we could create a weird base-90 format using every typeable character except for quote (which we’d need to start a DATA line else we couldn’t use comma), the decoder would be much larger and have to do much more work, and we actually wouldn’t benefit since we really need numbers that round to specific numbers of bits:

Base-8 (octal) values can be represented by 3-bits (111). (Extended BASIC supports octal when you use &Oxx or just &xx.)
Base-16 (hexadecimal) values can be represented as 4-bits (1111). (Extended BASIC supports hexadecimal when you use &Hxx.)
Base-32 values would be represented as 5-bits (11111).
Base-64 values would be represented as 6-bits (111111).
Base-128 values would be represented as 7-bits (111111).

As you can see, a base-90 value isn’t a large enough range to give us an extra bit over base-64. We need to use bases that are nice multiples of the power of 2. Because of this, we’ll ignore a made-up base-90 and look at something a bit more standard, such as base-64 encoding.

Pump up the base

As previously discussed, natively, you can represent a number in a DATA statement as a base-10 value, or a hexadecimal value. Both of these are the value 32:

100 DATA 32,&H20

BASIC will READ them the same way, though hex values are much faster for BASIC to read and parse. Using native hex values like “&H20” is the fastest way to load DATA, but it is also the largest since every value has two extra characters (“&H”) in front.

A recent tip was given by Shaun Bebbington about how you can represent zero just by leaving it out between commas. It saves space, and the parse gets zero from this faster than if you put a zero there:

100 DATA 8,6,7,5,3,,9

But since we are trying to get as much DATA in there as possible, we don’t want to separate numbers by commas. We can pack all the 2-digit hex values together in a string then read that entire string and parse out the individual 2-digit hex values. That is more work, and slower, but gets more data per DATA line. Here are the values 0 to 15 in hex (00 to 0f):

100 DATA 000102030405060708090A0B0C0D0E0F

As previously demonstrated, this is the most efficient way to store HEX values. Even when we pad a low 0-15 value to make it two digits (1 represented by 01), it stills saves space over comma delimited values since no commas are used.

But each hex value is wasting 50% of the bits it takes to represent it. HEX values of 0-15 could be represented by four bits (0000 to 1111). We are storing them as one 8-bit character and thus achieving 50% storage efficiency.

We can do better by using a higher base-x value that can use those wasted bits. We want the highest value we can represent with typeable characters, which is 64 (since the next higher would be 128 and we don’t have a way to type 128 different characters on the CoCo).

Base-64

The standard Base-64 encoding uses the following 64 characters to represent values of 0 to 63:

ABCDEFGHIJKLMNOPQRSTUVWZYZabcdefghijklmnopqrstuvwxyz01234567890+/

Each base-64 character needs 6-bits to be represented (000000-111111).

Representing values that way only wastes 2 bits per character, rather than 4-bits like hex base-16 does:

ASCII HEX Chars.:    ASCII Base-64 Chars.:
      0    15              0    63
     "0"   "F"            "A"   "/"
     /       \            /       \
xxxx0000  xxxx1111   xx000000   xx111111

But, converting to and from base-64 is much trickier. Hex base-16 is as simple as this:

Hex “F0” -> F is 15 which is 1111 in binary. 0 is 0000 in binary. Thus the first character becomes the left four bits, and the second character becomes the right four bits. Super easy. Barely an inconvenience. Two ASCII bytes represent one byte of data.

But for base-64, we are dealing with 6-bits, and two of those won’t fit into an 8-bit byte. Instead, four base-64 6-bit values are merged together to make a 3-byte 24-bit value.

Base-64 “ABCD” (xx000000 xx000001 xx000010 xx000011) -> A is 0 which is 000000 in binary. B is 1 which is 000001 in binary. C is 2 which is 000010 in binary. D is 3 which is 000011 in binary. These values are merged together (removing the unused 2-bits in each one) and stored in 3 bytes as:

+- Byte 1 --+- Byte 2 --+- Byte 3 --+
| 000000|00 | 0001|0000 | 10|000011 |
| \__A__/\___B___/ \___C___/ \_D__/

Byte 1 contains 6 bits of base-64 value A and 2 bits of base-64 value B.
Byte 2 contains 4 bits of base-64 value B and 4-bits of base-64 value C.
Byte 3 contains 2 bits of base-64 value C and 6 bits of base-64 value D.

Well that’s a mess. Moving bits around like that is super easy under languages like C, but a bit more work in BASIC.

Encode this!

We will start with encoding a simple ASCII string into base-64 using a web tool:

https://www.base64encode.org

If you go to that link, you can type something in and then encode it into base-64. I typed:

Greetings from Sub-Etha Software! Do you know where your towel is?

And that gets encoded into this:

R3JlZXRpbmdzIGZyb20gU3ViLUV0aGEgU29mdHdhcmUhIERvIHlvdSBrbm93IHdoZXJlIHlvdXIgdG93ZWwgaXM/

Each character represents a 6-bit (0-63) value which we will have to combine into 8-bit values and decode.

An easy way to decode the characters used by base-64 encoding is with a string:

10 Z$="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"

We can use Extended BASIC’s INSTR() function to match a character from the encoded string with a character in that string, and the position it is found in will the the value it represents (well, minus 1, since INSTR returns a base-1 value).

Here is an example that will display the bytes of the encoded string:

0 REM base64-1.bas
10 Z$="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
20 READ A$:PRINT A$
30 FOR A=1 TO LEN(A$)
40 PRINT INSTR(Z$,MID$(A$,A,1))-1;
50 NEXT
1000 REM BASE-64 DATA
1010 DATA R3JlZXRpbmdzIGZyb20gU3ViLUV0aGEgU29mdHdhcmUhIERvIHlvdSBrbm93IHdoZXJlIHlvdXIgdG93ZWwgaXM/

Running that shows me this:

If A is 0, then R should be 17, and that is what it prints first. Now we know we can get the values for each character in a base-64 encoded string.

Next we have to turn four 6-bit base-64 values into three bytes (24-bits). I am not sure what a good way to do this is, so I’ll just brute-force it and see how that works out.

First, I know that I need four base-64 values to make my 3 8-bit values, so I’ll modify my loop to skip every four values, and then add an inner loop to process the individual four base-64 values.

Inside that inner loop it will process the next four base-64 6-bit values and convert them into 3 8-bit values.

Here is what I came up with:

0 REM base64.bas
5 POKE65395,0
10 Z$="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
20 READ A$
30 FOR A=1 TO LEN(A$) STEP 4
35 REM GET 4 6-BIT VALUES
40 FOR B=0 TO 3:B(B)=INSTR(Z$,MID$(A$,A+B,1))-1
50 IFB(B)<0 THEN B(B)=0
60 NEXT
65 REM CONVERT TO 3 8-BIT
70 C1=INT(B(0)*INT(2^2)) OR INT(B(1)/INT(2^4))
80 C2=(B(1) AND &HF)*INT(2^4) OR B(2)/INT(2^2)
90 C3=(B(2) AND &H3)*INT(2^6) OR B(3)
100 PRINT CHR$(C1);CHR$(C2);CHR$(C3);
110 NEXT
120 END
1000 REM BASE-64 DATA
1010 DATA R3JlZXRpbmdzIGZyb20gU3ViLUV0aGEgU29mdHdhcmUhIERvIHlvdSBrbm93IHdoZXJlIHlvdXIgdG93ZWwgaXM/

I figured out all the 6-bit to 8-bit stuff (lines 70-90) with alot of trial and error, so I expect there is a faster and easier way to do this. But, then end results is a program that will print out the expected message, albeit really slowly.

A successful, but slow, decode of a base-64 encoded message.

One unexpected problem was with the powers of two — (2^2) and such. They produce rounding errors which caused some bits to be lost. I had to use INT() round them. That took me hours to figure out, but it’s just part of the inaccuracies of floating point values, especially limited ones like a 1970s BASIC used.

PROBLEM: Since the goal here is to put more data in DATA statements, the base-64 decode routine needs to be small. If it is 100 bytes larger than just using HEX, you have to save 100 bytes in DATA before you break even. The routine I give is not small and not fast. It would probably not be useful in the 10 LINE contest I mentioned. Maybe one of you can help improve it.

Now that we have a simple base-64 decoder, the next step will be making an encoder to turn DATA statement values into a base-64 string.

Until next time…

5 thoughts on “Compressing BASIC DATA with Base-64 – part 2”

MiaM November 19, 2020 at 5:26 am

I would had written out the 2 exponent values directly as 4, 16 and 64 rather than using INT(2^2).

The decdode-4-“chars”-to-3-bytes parts could use a modified base 64 thingy where you have the first 6 bits of three bytes in a row, and the fourth “char” contains the upper two bits for the previous three “chars”/bytes. Or the other way around, start with the “char” that contains the upper two bits for the following three chars and put that in a numerical variable. Then have a loop that runs three times. Each time first shift the “upper two bits” variable two steps left, i.e. multiply by 4, and then read a “char” and to that char OR the result of the “upper two bit” variable ANDed with 192 (=%110000).

That format would be incompatible with the standardized representation used by BASE 64, but it would indeed be a format with the same density and using the same characters.

Btw you could use the “BASE 90” as a way to slightly compress some data, by just having the values >63 represent two instances of the actual value minus 64. That might not save much, but perhaps worth investigating.

Loading...

Reply ↓
William Astle November 30, 2020 at 1:23 pm

For anyone wondering, the exponentiation operator actually calculates using some form of taylor series or something like that which means multiple multiplications and additions. MiaM’s suggestion of replacing the power calculations with constants will definitely result in a markedly faster end result.

Also, you might end up with something slightly faster by using multiple scalar variables instead of arrays. Parsing array subscripts isn’t as fast as you might hope due to the multiplication required to find the right element. Doing several extra linear scan steps on the scalar variable table may actually be faster and also you’ll save a couple of bytes and avoid parsing the array index as a floating point number. The speed gain would be most notable if the variables you use in the decoder are the first variables initialized by your program. Whether that improves the result substantially (especially from a size perspective) once you account for the INSTR bits is unclear. But you may possibly get a better result using VARPTR to obtain the address of the string data and using PEEK to obtain the characters in the string. At the very least that would make it faster due to a lot less expression evaluation for obtaining each character.

Loading...

Reply ↓
1. Allen Huffman Post authorNovember 30, 2020 at 1:28 pm
  
  When I looked at MID$ it appeared to make new strings each step, so I should benchmark that versus VARPTR.
  
  Loading...
  
  Reply ↓
Pingback: Compressing BASIC DATA with Base-64 – part 3 | Sub-Etha Software
Idioms January 9, 2022 at 3:07 am

Hi Allen,
I found your article very helpful and informative.
Thanks for share.

Loading...

Reply ↓