See also: part 1, part 2, part 3, part 4, part 5, part 6, part 7, part 8 and part 9.
Size Matters. Or Space Matters. You decide.
Sometimes we want to optimize for code space, and other times for variable and string space. For example, if you want to create a 32 character string like this:
…you could either declare it as a static string:
Or build it programatically like this:
The second version takes about 16 bytes less of program space because the string is generated dynamically in string memory rather than being stored in the tokenized BASIC program.
Doing it the second way seems like a good idea, but keep in mind when you make this string, somewhere in string memory will be those 32 characters, PLUS you still have the BASIC statements that created it. It’s actually larger, overall, to do it this way.
BUT, any temporary strings like that might make sense to create on-the-fly as you need them since that memory can be reused by other strings.
10 A$="<------------------------------->" 20 PRINT A$:PRINT "MAIN MENU":PRINT A$ 30 INPUT "COMMAND";C$
In the above example, A$ points to that sequence of characters INSIDE the BASIC program itself. It is always there. But, if you generated the string only when needed, the memory used by A$ could be used for other purposes:
10 A$="<"+STRING$(30,"-")+">" 20 PRINT A$:PRINT "MAIN MENU":PRINT A$ 30 INPUT "COMMAND";A$
Above, A$ is allocated and turned in to the long 32 character string, printed, and then the memory used by A$ can be reused by INPUT. I suppose just setting it to A$=”” might give it back, too.
This would come with a speed penalty since the creation and destruction of strings takes more CPU time than just using a static string.
I think I may have also mentioned that, even if a string is part of the BASIC program, if you do anything to it, it has to duplicate it in string memory which creates a second copy of it:
10 A$="<------------------------------->" 20 A$=A$+"HELLO"
Above, A$ initially starts out pointing inside the program itself, taking up none of that string memory. At line 20, the entire A$ gets copied in to string memory and then the extra characters are added to it. At that point, that string is now using over twice the memory (program space plus string space).
Let’s try to prove that. The CLEAR command is used to reserve memory for strings. By default, 200 bytes are reserved. We can change that by doing CLEAR 0. Here is a program that has no string memory, yet it works because the string is inside the program space:
10 CLEAR 0 20 A$="<------------------------------->"
If you run this, you can PRINT A$ and prove it exists, but the moment you try to declare a second string like B$=”HELLO” or even manipulate A$ like A$=A$+”” you will get an Out of String Space errors (?OS ERROR):
Sometimes you choose speed over size, and sometimes you choose size over speed. Thus, you can optimize for speed (which we have been doing so far), or optimize for size.
But I digress.
Elementary, my dear DATA
Today I want to discuss DATA statements. In my assembly language series, I showed how the lwasm assembler can generate a small BASIC program that has the assembly code in DATA statements, and a small loader which will READ them and POKE them in to memory:
10 READ A,B 20 IF A=-1 THEN 70 30 FOR C = A TO B 40 READ D:POKE C,D 50 NEXT C 60 GOTO 10 70 END 80 DATA 16128,16167,142,63,14,166,128,39,6,173,159,160,2,32,246,57,84,104,105,115,32,105,115,32,97,32,115,101,99,114,101,116,32,109,101,115,115,97,103,101,46,0,-1,-1
DATA statements can contain base-10 numbers, base-16 hexadecimal numbers, or strings (and I guess base-8 octal numbers too, but who would do that?). This means you could have the data stored as numbers:
100 DATA 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 100 DATA &H0,&H2,&H3,&H4,&H5,&H6,&H7,&H8,&H9,&HA,&HB,&HC,&HD,&HE,&HF
…or as strings like “FE” or “1F” that you could READ and convert to hex numbers in the loader:
100 DATA 59,4F,55,20,4D,55,53,54,20,42,45,20,42,4F,52,45,44
When it comes to a size, hexadecimal numbers without the “&H” in front are always smaller than their base-10 equivalent. Single-digit decimal values 0-9 are single digit 0-9 in hex. Double-digit decimal values 10-15 are represented by single digit hexadecimal values A-F. Every time a value from 10-15 appears, representing it in decimal takes up twice as much space. And for three digit decimal values 100-255, those are two digit hex values 64-FF.
If you store the data as strings, like this:
100 DATA 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F 101 DATA 10,11,12,13,14,15,16,17,18,19,1A,1B,1C,1D,1E,1F
…you can read each string in, and convert it to a number by adding “&H” to the start and using the VAL() function:
For the decimal and hexadecimal versions, you just read it as a number:
The smallest version would be the string approach, since three digit numbers can be represented with two digits. But, doing the string conversion with VAL() makes it slower.
The fastest version would be using hexadecimal numbers since BASIC can parse hex values faster than base-10 numbers. But, this is the largest version since 255 in decimal (3 characters) or FF as a hex string (2 characters) would be represented as &HFF as a hex number (4 characters). Those numbers would take up twice as much space as the string version!
In the middle is base-10 numbers. It’s not the largest, or the smallest, or the fastest or the slowest. It makes an ideal compromise.
Let’s do a test. I have DATA statements representing values from 0 to 255. I have three versions: the first will use base-10 numbers, the second will use hexadecimal numbers, and the third will use strings that are just the hex part of the “&H” number.
Base 10 Numbers
0 REM DATADEC.BAS 10 TIMER=0:TM=TIMER 20 FOR A=0 TO 255 30 READ B 40 NEXT 50 PRINT TIMER-TM 60 END 100 DATA 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 101 DATA 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 102 DATA 32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47 103 DATA 48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63 104 DATA 64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79 105 DATA 80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95 106 DATA 96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111 107 DATA 112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127 108 DATA 128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143 109 DATA 144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159 110 DATA 160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175 111 DATA 176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191 112 DATA 192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207 113 DATA 208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223 114 DATA 224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239 115 DATA 240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
Hexadecimal Base-16 Numbers
0 REM DATAHEX.BAS 10 TIMER=0:TM=TIMER 20 FOR A=0 TO 255 30 READ B 40 NEXT 50 PRINT TIMER-TM 60 END 100 DATA &H0,&H1,&H2,&H3,&H4,&H5,&H6,&H7,&H8,&H9,&HA,&HB,&HC,&HD,&HE,&HF 101 DATA &H10,&H11,&H12,&H13,&H14,&H15,&H16,&H17,&H18,&H19,&H1A,&H1B,&H1C,&H1D,&H1E,&H1F 102 DATA &H20,&H21,&H22,&H23,&H24,&H25,&H26,&H27,&H28,&H29,&H2A,&H2B,&H2C,&H2D,&H2E,&H2F 103 DATA &H30,&H31,&H32,&H33,&H34,&H35,&H36,&H37,&H38,&H39,&H3A,&H3B,&H3C,&H3D,&H3E,&H3F 104 DATA &H40,&H41,&H42,&H43,&H44,&H45,&H46,&H47,&H48,&H49,&H4A,&H4B,&H4C,&H4D,&H4E,&H4F 105 DATA &H50,&H51,&H52,&H53,&H54,&H55,&H56,&H57,&H58,&H59,&H5A,&H5B,&H5C,&H5D,&H5E,&H5F 106 DATA &H60,&H61,&H62,&H63,&H64,&H65,&H66,&H67,&H68,&H69,&H6A,&H6B,&H6C,&H6D,&H6E,&H6F 107 DATA &H70,&H71,&H72,&H73,&H74,&H75,&H76,&H77,&H78,&H79,&H7A,&H7B,&H7C,&H7D,&H7E,&H7F 108 DATA &H80,&H81,&H82,&H83,&H84,&H85,&H86,&H87,&H88,&H89,&H8A,&H8B,&H8C,&H8D,&H8E,&H8F 109 DATA &H90,&H91,&H92,&H93,&H94,&H95,&H96,&H97,&H98,&H99,&H9A,&H9B,&H9C,&H9D,&H9E,&H9F 110 DATA &HA0,&HA1,&HA2,&HA3,&HA4,&HA5,&HA6,&HA7,&HA8,&HA9,&HAA,&HAB,&HAC,&HAD,&HAE,&HAF 111 DATA &HB0,&HB1,&HB2,&HB3,&HB4,&HB5,&HB6,&HB7,&HB8,&HB9,&HBA,&HBB,&HBC,&HBD,&HBE,&HBF 112 DATA &HC0,&HC1,&HC2,&HC3,&HC4,&HC5,&HC6,&HC7,&HC8,&HC9,&HCA,&HCB,&HCC,&HCD,&HCE,&HCF 113 DATA &HD0,&HD1,&HD2,&HD3,&HD4,&HD5,&HD6,&HD7,&HD8,&HD9,&HDA,&HDB,&HDC,&HDD,&HDE,&HDF 114 DATA &HE0,&HE1,&HE2,&HE3,&HE4,&HE5,&HE6,&HE7,&HE8,&HE9,&HEA,&HEB,&HEC,&HED,&HEE,&HEF 115 DATA &HF0,&HF1,&HF2,&HF3,&HF4,&HF5,&HF6,&HF7,&HF8,&HF9,&HFA,&HFB,&HFC,&HFD,&HFE,&HFF
String HEX Numbers
0 REM DATASTR.BAS 10 TIMER=0:TM=TIMER 20 FOR A=0 TO 255 30 READ B$:B=VAL("&H"+A$) 40 NEXT 50 PRINT TIMER-TM 60 END 100 DATA 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F 101 DATA 10,11,12,13,14,15,16,17,18,19,1A,1B,1C,1D,1E,1F 102 DATA 20,21,22,23,24,25,26,27,28,29,2A,2B,2C,2D,2E,2F 103 DATA 30,31,32,33,34,35,36,37,38,39,3A,3B,3C,3D,3E,3F 104 DATA 40,41,42,43,44,45,46,47,48,49,4A,4B,4C,4D,4E,4F 105 DATA 50,51,52,53,54,55,56,57,58,59,5A,5B,5C,5D,5E,5F 106 DATA 60,61,62,63,64,65,66,67,68,69,6A,6B,6C,6D,6E,6F 107 DATA 70,71,72,73,74,75,76,77,78,79,7A,7B,7C,7D,7E,7F 108 DATA 80,81,82,83,84,85,86,87,88,89,8A,8B,8C,8D,8E,8F 109 DATA 90,91,92,93,94,95,96,97,98,99,9A,9B,9C,9D,9E,9F 110 DATA A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,AA,AB,AC,AD,AE,AF 111 DATA B0,B1,B2,B3,B4,B5,B6,B7,B8,B9,BA,BB,BC,BD,BE,BF 112 DATA C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,CA,CB,CC,CD,CE,CF 113 DATA D0,D1,D2,D3,D4,D5,D6,D7,D8,D9,DA,DB,DC,DD,DE,DF 114 DATA E0,E1,E2,E3,E4,E5,E6,E7,E8,E9,EA,EB,EC,ED,EE,EF 115 DATA F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,FA,FB,FC,FD,FE,FF
If we look at the size JUST the DATA statement lines take up (lines 100-115), here is the size breakdown:
- DATADEC.BAS – speed 78, size 1010
- DATAHEX.BAS – speed 49, size 1360
- DATASTR.BAS – speed 109, size 848
As you can see, using hex values is over twice as fast as using string versions and converting them to hex.
For size, using strings is about 15% smaller in my test program than using decimal values.
If load time is important, use hex. If program space is important, use strings. Otherwise, normal decimal values are a good compromise between speed and size.
One more thing… If we are going to use strings anyway, we could save more space by making the hex strings long, and parsing through them to pull out the individual hex values. Every number has to be two characters (00, 01, 02 … 0E, 0F) and this additional string parsing makes it even slower, but if code size is most important, try this:
0 REM DATASTR2.BAS 10 TIMER=0:TM=TIMER 20 FOR A=0 TO 15 30 READ B$:FOR I=1 TO 32 STEP 2:B=VAL("&H"+MID$(B$,I,2)):NEXT 40 NEXT 50 PRINT TIMER-TM 60 END 100 DATA 000102030405060708090A0B0C0D0E0F 101 DATA 101112131415161718191A1B1C1D1E1F 102 DATA 202122232425262728292A2B2C2D2E2F 103 DATA 303132333435363738393A3B3C3D3E3F 104 DATA 404142434445464748494A4B4C4D4E4F 105 DATA 505152535455565758595A5B5C5D5E5F 106 DATA 606162636465666768696A6B6C6D6E6F 107 DATA 707172737475767778797A7B7C7D7E7F 108 DATA 808182838485868788898A8B8C8D8E8F 109 DATA 909192939495969798999A9B9C9D9E9F 110 DATA A0A1A2A3A4A5A6A7A8A9AAABACADAEAF 111 DATA B0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF 112 DATA C0C1C2C3C4C5C6C7C8C9CACBCCCDCECF 113 DATA D0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF 114 DATA E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF 115 DATA F0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
- DATASTR2.BAS – speed 172, size 624
By removing all those commas, it’s the smallest data size yet. And, since the longest line you can type* in BASIC is 249 characters…
…you could really back some data in to it.
Side Note: *The BASIC editor allows for 249 characters, but when you press ENTER, the line is tokenized. Keywords like PRINT get reduced to smaller tokens. You may have typed a five character keyword (taking up part of that 249 byte buffer), but when you press ENTER, that five characters may be converted to a one byte token. This means it’s possible for a BASIC line to contain more valid code than you could actually type. There have been utilities for BASIC (such as Carl England‘s CRUNCH) that do this, packing program lines as big as they can be, and making them un-editable since the moment you try, they get detokenized and you lose anything past 249 characters. We’ll have to discuss this in a later installment.
With that in mind, we could pack any type of DATA in to fewer lines and save a bit. Each line number takes up 6 bytes, so every line we can eliminate makes our program smaller.
Through some trail-and-error experimentation, I got this:
0 REM DATASTR3.BAS 10 TIMER=0:TM=TIMER 20 FOR A=0 TO 15 30 READ B$:IF B$="*" THEN 50 35 FOR I=1 TO LEN(B$) STEP 2:B=VAL("&H"+MID$(B$,I,2)):NEXT 40 NEXT 50 PRINT TIMER-TM 60 END 100 DATA000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F7071727374757677 107 DATA78797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEF 108 DATAF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF,*
- DATASTR2.BAS – speed 167, size 532
As you can see, this is slightly faster than the previous combined hex string version because it does less READs. It is also slightly smaller because it has less line numbers. And, I think, it could even be packed a bit more, but because I am loading these test programs as ASCII files in to XRoar, the lines cannot exceed 249 characters (the same as typing them in) so this was as much as I could fit on them (even though using EDIT on these lines shows I could still type about 6 more characters, but it only seemed to show me 5 more after I re-listed it).
Fun with DATA, eh?
Until next time, I leave you with this:
A virtual cookie goes to the first person that finds them.
There’s more to it than space–there’s time. Depending on how strings are represented internally, concatenation may require repeated scanning of the string as it’s built up. It’s true for NUL-terminated strings in C, and for the Haskell String type, which is a list of characters. “Can you say O(n^2)? Sure, I knew you could.”
The way around it in C is printf (or sprintf if you really need to store it rather than print it. in BASIC, if I’m just going to print it, I’ll just PRINT “” (OK, &H1E if you want :). I know printf() will keep track of where it is and not repeatedly scan, and I’d bet BASIC PRINT does as well.
Oops… The blog software made most of the PRINT statement go away. It should have had what you were assigning to A$ in the example, but with ; instead of the +.
Actually, that probably isn’t O(n^2), because you rescan once for each string, not each character in each string a la bubble sort–but at the very least it ups the constant factor, which, while it’s not considered significant for big-O calculations, still can be a significant performance hit. (That’s why, for example, a typical sort implementation switches from quicksort to a simpler sort once the hunks being sorted are sufficiently small.)
Color Basic stores strings with an 8 bit length and a pointer to the data. They’re actually binary clean as a result and there’s no need to scan the string to find the end. String concatenation is actually O(n) where n is the combined length of the two strings. (It has to copy both strings to a newly allocated string space.) String space allocation is O(1) unless garbage collection is triggered in which case it’s O(n^2) where n is the number of extant string descriptors (which may be larger than the number of string variables due to anonymous strings being on the “string stack” during expression evaluation. The string stack overflowing is what causes “string formula too complex”.).
Skipping the concatenation step and just joining with “;” (or nothing) in PRINT is faster simply because it doesn’t have to bother allocating new string space and doing the concatenation. Instead, PRINT just has to look up the various strings. Both concatenation and PRINT use an incrementing pointer to traverse strings since they both process entire strings. So, in actual fact PRINT with concatenation is O(2n) and PRINT with “;” is O(n) if you keep the constant factors since the total string length is scanned twice with the concatenation option and only once with the PRINT and “;” option.
Pingback: CoCo bouncing BASIC ball, part 5 | Sub-Etha Software