I’m trying to print out a 178-character string in brainfuck. This wouldn’t be a problem except I’m limited to using 270 characters of brainfuck. I was thinking of hashing the 178-character string using a two-way hashing function, but I've been having trouble finding a solution that works. Here is the string: "Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you are getting the best possible information." - Michael Scott.
Running the string straight-up in some ascii->brainfuck programs is giving me about 1,409 characters, far off from my target of 270. I think I should be able to create the brainfuck code with a string of about 60 characters. So my question is, is there any way to convert the above string to a string of 60 characters that can later be decoded back to the string?
It's most probably impossible. Brainfuck isn't magic. The current record for shortest code to print the 13-char string "Hello, World!" is an entire 78 bytes.:
--<-<<+[+[<+>--->->->-<<<]>]<<--.<++++++.<<-..<<.<+.>>.>>.<<<.+++.>>.>>-.<<<+.
I suggest you read the post, but a TL;DR for you is that the tape is first initialized with a recurrence relation, then poke around the tape to print the appropriate characters.
78 bytes for thirteen characters. On a relatively simple string. That's 6 bytes for each character. Using the same metric as a rough guide (and this result is an underestimate), your string would take 1068 characters--at a minimum. Given that, however, the tape initialization occurs only once, and this can be surprisingly small, you may (may) be able to get it down to the high 900 or 800s. Your string also happens to be more complex and differ very widely in ASCII values, something even a recurrence relation is unlikely to solve. I however have no example that small.
Related
I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.
I would like to store some binary data in a BASIC program on the Commodore 64 as DATA statements. To save space, I'd prefer to store as a string, rather than as a sequence of numbers.
Is it possible to store any character, from CHR$(0) through CHR$(255), in a DATA statement, or are certain characters impossible to represent this way? What is the complete list of characters that cannot be represented in a DATA statement (if any)?
I'm particularly wondering about CHR$(0), double quote ("), newline and carriage return. If these can be represented, how?
Short answer: No. And you said why: the double-quote character inside a string generates an error: there are no quote-escape characters. For every Other value, you might be able to poke stuff into your DATA statement strings and then just never touch those lines again with the C64 BASIC editor, but the double quotes would kill you.
The best and fastest solution I've yet to think up is poor mans hex. It works like this:
Take each binary byte. Separate it into its two hex digits (/16 and keep the remainder for the second digit).
For each hex digit, take the binary value and add 48.
Now you have two characters in the set (0,1,2,3,4,5,6,7,8,9,:,;,<,=,>,?) that represent one byte.
Those two characters go into your data statement string.
Reverse the process to read them and poke them out.
There is a way to do this, you can POKE bytes directly into RAM. It's a bit of a long way around though, and you need to know where you're POKEing the bytes to. You could negate the need for lots of zeros in your DATA statement though, like this:
0 FOR I=0 TO 7
1 READ A(I)
2 NEXT I
3 PRINT A(0), A(4)
63998 PRINT "FIN"
63999 DATA ,,,,4,,7,8
We know that 2048 is the start of the BASIC area (unless you've moved the pointers), so at a guess, one could do this:
0 DATA" "," "," "," "," "
Then POKE around 2050 or 2051 with a character that you'd recognise and then list it. If you see the character added in between the double quotes then you win. Of course, then you need to calculate each position between the quotes thereafter. When you're done, renumber your line number and carry on programming. I'm not sure how you'd POKE a double quote in between a double quote as there is no notion of escaping a string in Commodore BASIC as far as I know.
I'd personally just use numbers though.
I have stored the following data statement, each element as a string, in a C64 program. I chose CHR$(172) - CHR$(190), and two above CHR$(4000).
100 data "©","ª","«","¬"," ","®","¯","¶","¼","½","¾","™","ח","⦁"
And I ran the following code:
10 FOR X=1 TO 14
20 READ A$
30 PRINT ASC(A$)
40 NEXT X
100 data "©","ª","«","¬"," ","®","¯","¶","¼","½","¾","™","ח","⦁"
The results were mixed. I knew it would not recognize anything above 255. But the CHR$(173) printed as a 32 instead:
RUN
169
170
171
172
32
174
175
182
188
189
190
?SYNTAX ERROR IN 100
READY.
I resisted the program, and my DATA statement now looks like this:
100 DATA "©","ª","«","¬"," ","®","¯","¶","¼","½","¾",""","",""
Using another BASIC dialect, one more modern and written in the past few years, this was my output of the CHR$ for 172 to 190:
The ASCII value of A is: 65
The ASCII value of A should be 65, like it is on a PC.
If it is not 65, then a conversion table must be loaded
and the results converted to match the PC so code
CHR$ VALUES
—————————————————
CHR$(169)=© CHR$(170)=ª CHR$(171)=« CHR$(172)=¬ CHR$(173)=
CHR$(174)=® CHR$(175)=¯ CHR$(176)=° CHR$(177)=± CHR$(178)=²
CHR$(179)=³ CHR$(180)=´ CHR$(181)=µ CHR$(182)=¶ CHR$(183)=·
CHR$(184)=¸ CHR$(185)=¹ CHR$(186)=º CHR$(187)=» CHR$(188)=¼
CHR$(189)=½ CHR$(190)=¾
For C64 BASIC, you either must use the string of numbers, or you will have to use the HEX values and store the actual characters as I have done in my original C64 DATA statement.
I don't know exactly how much space you think you are going to save, but it will be minimal at best, as C64 can't go past CHR$(255).
However, the other dialect I used, SmartBASIC, I went out past CHR$(20480).
I hope this helps.
I have a problem that I've been trying to solve for a few days now, but it seems i got stuck!
The problem is, given a piece of data, i need to generate an output that has to be:
Impossible for others to reproduce.
Unique.
Short (since it will be used in a URL)
So i've decided that i would sign the given data with a private key, and then hash the result - This way i believe the first two properties are covered.
Now, to get the final result as short as possible i started reading and i came across the THIS, that talks about truncate a Hash, and i quote
As far as truncating a hash goes, that's fine. It's explicitly
endorsed by the NIST, and there are hash functions in the SHA-2 family
that are simple truncated variants of their full brethren:
SHA-256/224, SHA-512/224, SHA-512/256, and SHA-512/384, where SHA-x/y
denotes a full-length SHA-x truncated to y bits.
After reading a bit more about this, i decided that i would use SHA256 and truncated it to 128 bits.
Since it has to be used in a URL, i parsed the final result into a Base62 (Base64 uses ? and = signs which have a different meaning in a URL environment).
The final result was a string with 24 characters, which i thought was good but it seems it's still not good enough.
I need to get it even shorter (around 10 characters). I've been around this for a few days and i am starting to get out of ideas.
Does anyone have any suggestions? Is there something i can try?
Thank you very much!
I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding
A few days ago, I asked why its not possible to store binary data, such as a jpg file into a string variable.
Most of the answers I got said that string is used for textual information such as what I'm writing now.
What is considered textual data though? Bytes of a certain nature represent a jpg file and those bytes could be represented by character byte values...I think. So when we say strings are for textual information, is there some sort of range or list of characters that aren't stored?
Sorry if the question sounds silly. Just trying to 'get it'
I see three major problems with storing binary data in strings:
Most systems assume a certain encoding within string variables - e.g. if it's a UTF-8, UTF-16 or ASCII string. New line characters may also be translated depending on your system.
You should watch out for restrictions on the size of strings.
If you use C style strings, every null character in your data will terminate the string and any string operations performed will only work on the bytes up to the first null.
Perhaps the most important: it's confusing - other developers don't expect to find random binary data in string variables. And a lot of code which works on strings might also get really confused when encountering binary data :)
I would prefer to store binary data as binary, you would only think of converting it to text when there's no other choice since when you convert it to a textual representation it does waste some bytes (not much, but it still counts), that's how they put attachments in email.
Base64 is a good textual representation of binary files.
I think you are referring to binary to text encoding issue. (translate a jpg into a string would require that sort of pre-processing)
Indeed, in that article, some characters are mentioned as not always supported, other can be confusing:
Some systems have a more limited character set they can handle; not only are they not 8-bit clean, some can't even handle every printable ASCII character.
Others have limits on the number of characters that may appear between line breaks.
Still others add headers or trailers to the text.
And a few poorly-regarded but still-used protocols use in-band signaling, causing confusion if specific patterns appear in the message. The best-known is the string "From " (including trailing space) at the beginning of a line used to separate mail messages in the mbox file format.
Whoever told you you can't put 'binary' data into a string was wrong. A string simply represents an array of bytes that you most likely plan on using for textual data... but there is nothing stopping you from putting any data in there you want.
I do have to be careful though, because I don't know what language you are using... and in some languages \0 ends the string.
In C#, you can put any data into a string... example:
byte[] myJpegByteArray = GetBytesFromSomeImage();
string myString = Encoding.ASCII.GetString(myJpegByteArray);
Before internationalization, it didn't make much difference. ASCII characters are all bytes, so strings, character arrays and byte arrays ended up having the same implementation.
These days, though, strings are a lot more complicated, in order to deal with thousands of foreign language characters and the linguistic rules that go with them.
Sure, if you look deep enough, everything is just bits and bytes, but there's a world of difference in how the computer interprets them. The rules for "text" make things look right when it's displayed to a human, but the computer is free to monkey with the internal representation. For example,
In Unicode, there are many encoding systems. Changing between them makes every byte different.
Some languages have multiple characters that are linguistically equivalent. These could switch back and forth when you least expect it.
There are different ways to end a line of text. Unintended translations between CRLF and LF will break a binary file.
Deep down everything is just bytes.
Things like strings and pictures are defined by rules about how to order bytes.
strings for example end in a byte with value 32 (or something else)
jpg's don't
Depends on the language. For example in Python string types (str) are really byte arrays, so they can indeed be used for binary data.
In C the NULL byte is used for string termination, so a sting cannot be used for arbitrary binary data, since binary data could contain null bytes.
In C# a string is an array of chars, and since a char is basically an alias for 16bit int, you can probably get away with storing arbitrary binary data in a string. You might get errors when you try to display the string (because some values might not actually correspond to a legal unicode character), and some operations like case conversions will probably fail in strange ways.
In short it might be possible in some langauges to store arbitrary binary data in strings, but they are not designed for this use, and you may run into all kinds of unforseen trouble. Most languages have a byte-array type for storing arbitrary binary data.
I agree with Jacobus' answer:
In the end all data structures are made up of bytes. (Well, if you go even deeper: of bits). With some abstraction, you could say that a string or a byte array are conventions for programmers, on how to access them.
In this regard, the string is an abstraction for data interpreted as a text. Text was invented for communication among humans, computers or programs do not communicate very well using text. SQL is textual, but is an interface for humans to tell a database what to do.
So in general, textual data, and therefore strings, are primarily for human to human, or human to machine interaction (say for the content of a message box). Using them for something else (e.g. reading or writing binary image data) is possible, but carries lots of risk bacause you are using the data type for something it was not designed to handle. This makes it much more error prone. You may be able to store binary data in strings, mbut just because you are able to shoot yourself in the foot, you should avoid doing so.
Summary: You can do it. But you better don't.
Your original question (c# - What is string really good for?) made very little sense. So the answers didn't make sense, either.
Your original question said "For some reason though, when I write this string out to a file, it doesn't open." Which doesn't really mean much.
Your original question was incomplete, and the answers were misleading and confusing. You CAN store anything in a String. Period. The "strings are for text" answers were there because you didn't provide enough information in your question to determine what's going wrong with your particular bit of C# code.
You didn't provide a code snippet or an error message. That's why it's hard to 'get it' -- you're not providing enough details for us to know what you don't get.