Is it possible to store any PETSCII character in a DATA statement string in Commodore BASIC? - basic

I would like to store some binary data in a BASIC program on the Commodore 64 as DATA statements. To save space, I'd prefer to store as a string, rather than as a sequence of numbers.
Is it possible to store any character, from CHR$(0) through CHR$(255), in a DATA statement, or are certain characters impossible to represent this way? What is the complete list of characters that cannot be represented in a DATA statement (if any)?
I'm particularly wondering about CHR$(0), double quote ("), newline and carriage return. If these can be represented, how?

Short answer: No. And you said why: the double-quote character inside a string generates an error: there are no quote-escape characters. For every Other value, you might be able to poke stuff into your DATA statement strings and then just never touch those lines again with the C64 BASIC editor, but the double quotes would kill you.
The best and fastest solution I've yet to think up is poor mans hex. It works like this:
Take each binary byte. Separate it into its two hex digits (/16 and keep the remainder for the second digit).
For each hex digit, take the binary value and add 48.
Now you have two characters in the set (0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?) that represent one byte.
Those two characters go into your data statement string.
Reverse the process to read them and poke them out.

There is a way to do this, you can POKE bytes directly into RAM. It's a bit of a long way around though, and you need to know where you're POKEing the bytes to. You could negate the need for lots of zeros in your DATA statement though, like this:
0 FOR I=0 TO 7
1 READ A(I)
2 NEXT I
3 PRINT A(0), A(4)
63998 PRINT "FIN"
63999 DATA ,,,,4,,7,8
We know that 2048 is the start of the BASIC area (unless you've moved the pointers), so at a guess, one could do this:
0 DATA" "," "," "," "," "
Then POKE around 2050 or 2051 with a character that you'd recognise and then list it. If you see the character added in between the double quotes then you win. Of course, then you need to calculate each position between the quotes thereafter. When you're done, renumber your line number and carry on programming. I'm not sure how you'd POKE a double quote in between a double quote as there is no notion of escaping a string in Commodore BASIC as far as I know.
I'd personally just use numbers though.

I have stored the following data statement, each element as a string, in a C64 program. I chose CHR$(172) - CHR$(190), and two above CHR$(4000).
100 data "©","ª","«","¬"," ","®","¯","¶","¼","½","¾","™","ח","⦁"
And I ran the following code:
10 FOR X=1 TO 14
20 READ A$
30 PRINT ASC(A$)
40 NEXT X
100 data "©","ª","«","¬"," ","®","¯","¶","¼","½","¾","™","ח","⦁"
The results were mixed. I knew it would not recognize anything above 255. But the CHR$(173) printed as a 32 instead:
RUN
169
170
171
172
32
174
175
182
188
189
190
?SYNTAX ERROR IN 100
READY.
I resisted the program, and my DATA statement now looks like this:
100 DATA "©","ª","«","¬"," ","®","¯","¶","¼","½","¾",""","",""
Using another BASIC dialect, one more modern and written in the past few years, this was my output of the CHR$ for 172 to 190:
The ASCII value of A is: 65
The ASCII value of A should be 65, like it is on a PC.
If it is not 65, then a conversion table must be loaded
and the results converted to match the PC so code
CHR$ VALUES
—————————————————
CHR$(169)=© CHR$(170)=ª CHR$(171)=« CHR$(172)=¬ CHR$(173)=­
CHR$(174)=® CHR$(175)=¯ CHR$(176)=° CHR$(177)=± CHR$(178)=²
CHR$(179)=³ CHR$(180)=´ CHR$(181)=µ CHR$(182)=¶ CHR$(183)=·
CHR$(184)=¸ CHR$(185)=¹ CHR$(186)=º CHR$(187)=» CHR$(188)=¼
CHR$(189)=½ CHR$(190)=¾
For C64 BASIC, you either must use the string of numbers, or you will have to use the HEX values and store the actual characters as I have done in my original C64 DATA statement.
I don't know exactly how much space you think you are going to save, but it will be minimal at best, as C64 can't go past CHR$(255).
However, the other dialect I used, SmartBASIC, I went out past CHR$(20480).
I hope this helps.

Related

Encoding binary strings into arbitrary alphabets

If you have a set of binary strings that are limited to some normally-small size such as 256 or up to 512 bytes like some of the hashing algorithms, then if you want to encode those bits of 1's and 0's into say hex (a 16-character alphabet), then you take the whole string at once into memory and convert it into hex. At least that's what I think it means.
I don't have this question fully formulated, but what I'm wondering is if you can convert an arbitrarily long binary string into some alphabet, without needing to read the whole string into memory. The reason this isn't fully formed question is because I'm not exactly sure if you typically do read the whole string into memory to create the encoded version.
So if you have something like this:
1011101010011011011101010011011011101010011011110011011110110110111101001100101010010100100000010111101110101001101101110101001101101110101001101111001101111011011011110100110010101001010010000001011110111010100110110111010100110110111010100110111100111011101010011011011101010011011011101010100101010010100100000010111101110101001101101110101001101101111010011011110011011110110110111101001100101010010100100000010111101110101001101101101101101101101111010100110110111010100110110111010100110111100110111101101101111010011001010100101001000000101111011101010011011011101010011011011101010011011110011011110110110111101001100 ... 10^50 longer
Something like the whole genetic code or a million billion times that, it would be too large to read into memory and too slow to wait to dynamically create an encoding of it into hex if you have to stream the whole thing through memory before you can figure out the final encoding.
So I'm wondering three things:
If you do have to read something fully in order to encode it into some other alphabet.
If you do, then why that is the case.
If you don't, then how it works.
The reason I'm asking is because looking at a string like 1010101, if I were to encode it as hex there are a few ways:
One character at a time, so it would essentially stay 1010101 unless the alphabet was {a, b} then it would be abababa. This is the best case because you don't have to read anything more than 1 character into memory to figure out the encoding. But it limits you to a 2-character alphabet. (Anything more than 2 character alphabets and I start getting confused)
By turning it into an integer, then converting that into a hex value. But this would require reading the whole value to compute the final (big)integer size. So that's where I get confused.
I feel like the third way (3) would be to read partial chunks of the input bytes somehow, like 1010 then010, but that would not work if the encoding was integers because 1010 010 = A 2 in hex, but 2 = 10 not 2 = 010. So it's like you would need to break it by having a 1 at the beginning of each chunk. But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this. Hence the above questions.
For an example, say I wanted to encode the above binary string into an 8-bit alphabet, so like ASCII. Then I might have aBc?D4*&((!.... But then to deserialize this into the bits is one part, and to serialize the bits into this is another (these characters aren't the actual characters mapped to the above bit example).
But then what if you wanted to have each chunk no longer than 10 hex characters, but you have a long string of 1000 0's, then you need some other trick perhaps like having the encoded hex value tell you how many preceding zeroes you have, etc. So it seems like it gets complicated, wondering if there are already some systems established that have figured out how to do this
Yes you're way over-complicating it. To start simple, consider bit strings whose length is by definition a multiple of 4. They can be represented in hexadecimal by just grouping the bits up by 4 and remapping that to hexadecimal digits:
raw: 11011110101011011011111011101111
group: 1101 1110 1010 1101 1011 1110 1110 1111
remap: D E A D B E E F
So 11011110101011011011111011101111 -> DEADBEEF. That all the nibbles had their top bit set was a coincidence resulting from choosing an example that way. By definition the input is divided up into groups of four, and every hexadecimal digit is later decoded to a group of four bits, including leading zeroes if applicable. This is all that you need for typical hash codes which have a multiple of 4 bits.
The problems start when we want encode bit strings that are of variable length and not necessarily a multiple of 4 long, then there will have to be some padding somewhere, and the decoder needs to know how much padding there was (and where, but the location is a convention that you choose). This is why your example seemed so ambiguous: it is. Extra information needs to be added to tell the decoder how many bits to discard.
For example, leaving aside the mechanism that transmits the number of padding bits, we could encode 1010101 as A5 or AA or 5A (and more!) depending on the location we choose for the padding, whichever convention we choose the decoder needs to know that there is 1 bit of padding. To put that back in terms of bits, 1010101 could be encoded as any of these:
x101 0101
101x 0101
1010 x101
1010 101x
Where x marks the bit which is inserted in the encoder and discarded in the decoder. The value of that bit doesn't actually matter because it is discarded, so DA is also a fine encoding and so on.
All of the choices of where to put the padding still enable the bit string to be encoded incrementally, without storing the whole bit string in memory, though putting the padding in the first hexadecimal digit requires knowing the length of the bit string up front.
If you are asking this in the context of Huffman coding, you wouldn't want to calculate the length of the bit string in advance so the padding has to go at the end. Often an extra symbol is added to the alphabet that signals the end of the stream, which usually makes it unnecessary to explicitly store how much padding bits there are (there might be any number of them, but as they appear after the STOP symbol, the decoder automatically disregards them).

How can I compress/decompress a string into a 32-character alphanumeric string using Excel/VBA?

I hope I'm asking the right question, but if not I'd like to clarify what I'm trying to do.
I'd like to take a string of alphabetic text and convert it into a 32-character alphanumeric string of text. I would then like to covert that 32-character alphanumeric string of text back to the original text. I only care about shortening the length. I don't care about security. I don't care about maintaining the case sensitivity of the original string.
Is something like this even possible or am I trying to create bytes of data out of thin air?
In base 62 (62 different characters), a string of 32 characters can represent 62^32 different values, which comes out to about 3.7922554e+57.
An arbitrary string in your origin alphabet of 28 characters (A-Z plus space and period) of length 39 could be any of 2.7489271e+56 different patters. If the length is increased to 40 characters, there are 7.6969959e+57 different patterns.
What this means: The theoretical maximum length arbitrary string you could store is 39 characters. In practice, it would almost certainly be less than that. Note that the example string you provided in the comments is 55 characters long.
This is for an arbitrary string. Compression algorithms often look for patterns that can be compressed further. But that depends on the nature of the string being compressed.

Printing out 178-character string in 270 characters of brainf*ck

I’m trying to print out a 178-character string in brainfuck. This wouldn’t be a problem except I’m limited to using 270 characters of brainfuck. I was thinking of hashing the 178-character string using a two-way hashing function, but I've been having trouble finding a solution that works. Here is the string: "Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you are getting the best possible information." - Michael Scott.
Running the string straight-up in some ascii->brainfuck programs is giving me about 1,409 characters, far off from my target of 270. I think I should be able to create the brainfuck code with a string of about 60 characters. So my question is, is there any way to convert the above string to a string of 60 characters that can later be decoded back to the string?
It's most probably impossible. Brainfuck isn't magic. The current record for shortest code to print the 13-char string "Hello, World!" is an entire 78 bytes.:
--<-<<+[+[<+>--->->->-<<<]>]<<--.<++++++.<<-..<<.<+.>>.>>.<<<.+++.>>.>>-.<<<+.
I suggest you read the post, but a TL;DR for you is that the tape is first initialized with a recurrence relation, then poke around the tape to print the appropriate characters.
78 bytes for thirteen characters. On a relatively simple string. That's 6 bytes for each character. Using the same metric as a rough guide (and this result is an underestimate), your string would take 1068 characters--at a minimum. Given that, however, the tape initialization occurs only once, and this can be surprisingly small, you may (may) be able to get it down to the high 900 or 800s. Your string also happens to be more complex and differ very widely in ASCII values, something even a recurrence relation is unlikely to solve. I however have no example that small.

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

Encoding a 5 character string into a unique and repeatable 32bit Integer

I've not given this much thought yet, so I might turn out to be a silly question.
How can I take unique 5 ASCII character string and convert into a unique and reproducable (i.e needs to be the same every time) 32 bit integer?
Any ideas?
Assuming it is in fact ASCII (i.e., no characters with ordinal values greater than 127), you have five characters of 7 bits, or 35 bits of information. There is no way to generate a 32-bit code from 35 bits that is guaranteed to be unique; you're missing three bits, so each code will also represent 7 other valid ASCII strings. However, you can make it very, very unlikely that you will ever see a collision by being careful in how you calculate the code so that input strings that are very similar have very different codes. I see another answer has suggested CRC-32. You could also use a hash function such as MD5 or SHA-1 and use only the first 32 bits; this is probably best because hash functions are specifically designed for this purpose.
If you can further constrain the values of the input string (say, only alphanumeric, no lowercase, no control characters, or something of the sort), you can probably eliminate that extra data and generate guaranteed unique 32-bit codes for each string.
If they're guaranteed to be alphanumeric only, and case-insensitive ([A-Z][0-9]) you can treat it as a base-36 number.
If all five characters will belong to a set of 84 or fewer distinct characters, then you can squish five of them into a longword. Convert each character into a value 0..83, then
intvalue = ((((char4*84+char1)*83+char2)*82+char3)*81+char0)
char0 = intvalue % 84
char1 = (intvalue / 84) % 84;
char2 = (intvalue / (84*84)) % 84;
char3 = (intvalue / (84*84L*84)) % 84;
char4 = (intvalue / (84*84L*84*84L) % 84;
BTW, I wonder if anyone uses base-84 encoding as a standard; on many platforms it could be easier to handle than base-64, and the results would be more compact.
If you need to handle extended ASCII you are out of luck, as you would need 5 full chars which is 40 bits. Even with non-extended chars (top bit not used), you are still out of luck as you are trying to encode 35 bits of ASCII data into 32 bits of integer.
ascii goes from 0-255, which takes 8 bits... In 32 bits, you have 4 of those, not 5. So, to make it short and sweet, you can't do this.
Even if you are willing to ignore the high-order (values 128-255) ascii (use only ascii characters 0-127) and just use 7 bits per character, you are still 3 bits short (7*5 = 35 and you only have 32 available.
One way is to treat the 5 characters as numerals in base N, where N is the number of characters in your alphabet (the set of allowed characters). From there on, it's just simple base conversion.
Given that you have 32 bits available, and 5 characters to store, that means you can have 32^(1/5)=84 characters in your alphabet.
Assuming you only include basic ASCII, not extended ASCII (>127), you have 7 bits of information in a single character, so that's a bit of a problem - there are too many possibilities to create unique values for every string. However, the first 32 characters, as well as the last character, are control characters, and if you exclude those, you're down to 95 characters.
You still have to cut 11 characters, though. Wikipedia has a nice chart of the characters in ASCII which you can use to determine which characters you need.

Resources