Sentence that uses every base64 character - base64

I am trying to construct a sentence/letter combination that will return every base64 character, but failing to find a word for purposes of unit testing.
The unit tests I have so far are failing to hit the lines that handle the + and / characters. While I can sling a them at the encoder/decoder directly it would be nice to have a human readable source (the base64 equivalent of 'the quick brown dog').

Here is a Base64 encoded test string that includes all 64 possible Base64 symbols:
char base64_encoded_test[] =
"U28/PHA+VGhpcyA0LCA1LCA2LCA3LCA4LCA5LCB6LCB7LCB8LCB9IHRlc3RzIEJhc2U2NCBlbmNv"
"ZGVyLiBTaG93IG1lOiBALCBBLCBCLCBDLCBELCBFLCBGLCBHLCBILCBJLCBKLCBLLCBMLCBNLCBO"
"LCBPLCBQLCBRLCBSLCBTLCBULCBVLCBWLCBXLCBYLCBZLCBaLCBbLCBcLCBdLCBeLCBfLCBgLCBh"
"LCBiLCBjLCBkLCBlLCBmLCBnLCBoLCBpLCBqLCBrLCBsLCBtLCBuLCBvLCBwLCBxLCByLCBzLg==";
char base64url_encoded_test[] =
"U28_PHA-VGhpcyA0LCA1LCA2LCA3LCA4LCA5LCB6LCB7LCB8LCB9IHRlc3RzIEJhc2U2NCBlbmNv"
"ZGVyLiBTaG93IG1lOiBALCBBLCBCLCBDLCBELCBFLCBGLCBHLCBILCBJLCBKLCBLLCBMLCBNLCBO"
"LCBPLCBQLCBRLCBSLCBTLCBULCBVLCBWLCBXLCBYLCBZLCBaLCBbLCBcLCBdLCBeLCBfLCBgLCBh"
"LCBiLCBjLCBkLCBlLCBmLCBnLCBoLCBpLCBqLCBrLCBsLCBtLCBuLCBvLCBwLCBxLCByLCBzLg==";
It decodes to a string composed entirely of relatively human-readable text:
char test_string[] = "So?<p>"
"This 4, 5, 6, 7, 8, 9, z, {, |, } tests Base64 encoder. "
"Show me: #, A, B, C, D, E, F, G, H, I, J, K, L, M, "
"N, O, P, Q, R, S, T, U, V, W, X, Y, Z, [, \\, ], ^, _, `, "
"a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s.";
This decoded string contains only letters in the limited range of isprint()'able 7-bit ASCII characters (space through '~').
Since I did it, I would argue that it is possible :-).

You probably can't do that.
/ in base64 encodes 111111 (6 '1' bits).
As all ASCII (which are the type-able and printable characters) are in the range of 0-127 (i.e. 00000000 and 01111111), the only ASCII character that could be encoded using '/' is the ASCII character with the code 127, which is the non-printable DEL character.
If you allow values higher than 127, you could have a printable but non-typeable string.

When attempting to encode/decode, this is the one place where I break the rule of unit testing a single method at once. You can have methods for encoding or decoding separately, but the only way to tell if you're doing it correctly is to use both encoding and decoding in a single assert. I would use the following psuedo code.
Generate a random string using Path.GetRandomFilename() this string is cryptographically strong
Pass the string to the encode method
Pass the output of the encode to the decode method
Assert.AreEqual(input from GetRandomFilename, output from Decode)
You can loop over this as many times as you want in order to say it's tested. You can also cover some specific cases; however, since encoding (sometimes) differs based on the positioning of the letters, you're better off going with a random string and just calling encode/decode about 50 or so times.
If you find that encoding/decoding fails in accepted scenarios, create unit tests for those and filter out the strings that contain those characters/character combinations. Also, document those failures in XMLDocs comments, code comments, and any documentation your app has.

What I came up with, may prove not unuseful. Needs to be entered exactly as is: I include a link to a screenshot showing all the usually invisible characters below, as well as the Base64 data string to which it converts, and a table of the relevant statistics pertinent to each of the 64 characters therein.
<HTML><HEAD></HEAD><BODY><PRE>
Did
THE
THE QUICK BROWN FOX
jump
over
the
lazy
dogs
or
was
he
pushed
?
</PRE><B>hmm.</B></BODY><HTML>
ÿß®Þ~c*¯/
This encodes to the Base64 string:
PEhUTUw+PEhFQUQ+PC9IRUFEPjxCT0RZPjxQUkU+DQpEaWQJDQoNCiBUSEUJDQoNCiAgVEhFIFFVSUNLIEJST1dOIEZPWAkNCg0KICAganVtcAkNCg0KICAgIG92ZXIJDQoNCiAgICAgdGhlCQ0KDQogICAgICBsYXp5CQ0KDQogICAgICAgZG9ncwkNCg0KICAgICAgICBvcgkNCg0KICAgICAgICAgd2FzCQ0KDQogICAgICAgICAgaGUJDQoNCiAgICAgICAgICAgcHVzaGVkCQ0KDQogICAgICAgICAgICA/CQ0KDQo8L1BSRT48Qj5obW0uPC9CPjwvQk9EWT48SFRNTD4NCg0KDQoNCg0KDQoNCg//367efmMqry/==
which contains
5--/'s
4--+'s
3--='s
14--0's
3--1's
3--2's
2--3's
4--4's
3--5's
2--6's
2--7's
4--8's
6--9's
5--a's
27--A's
2--b's
5--B's
5--c's
4--C's
4--d's
14--D's
2--e's
10--E's
2--f's
8--F's
36--g's
6--G's
5--h's
2--H's
5--i's
30--I's
5--j's
6--J's
8--k's
12--K's
2--l's
3--L's
2--m's
4--M's
3--n's
14--N's
13--o's
2--O's
3--p's
9--P's
2--q's
24--Q's
2--r's
5--R's
2--s's
6--S's
2--t's
7--T's
2--u's
1--U's
3--v's
6--V's
4--w's
5--W's
3--x's
6--X's
2--y's
4--Y's
3--z's
5--Z's

Related

base64.decode: Invalid encoding before padding

I'm working on a flutter project and I'm currently getting an error with some of the strings I try do decode using the base64.decode() method. I've created a short dart code which can reproduce the problem I'm facing with a specific string:
import 'dart:convert';
void main() {
final message = 'RU5UUkVHQUdSQVRJU1==';
print(utf8.decode(base64.decode(message)));
}
I'm getting the following error message:
Uncaught Error: FormatException: Invalid encoding before padding (at character 19)
RU5UUkVHQUdSQVRJU1==
I've tried decoding the same string with JavaScript and it works fine. Would be glad if someone could explain why am I getting this error, and possibly show me a solution. Thanks.
Base64 encoding breaks binary data into 6-bit segments of 3 full bytes and represents those as printable characters in ASCII standard. It does that in essentially two steps.
The first step is to break the binary string down into 6-bit blocks. Base64 only uses 6 bits (corresponding to 2^6 = 64 characters) to ensure encoded data is printable and humanly readable. None of the special characters available in ASCII are used.
The 64 characters (hence the name Base64) are 10 digits, 26 lowercase characters, 26 uppercase characters as well as the Plus sign (+) and the Forward Slash (/). There is also a 65th character known as a pad, which is the Equal sign (=). This character is used when the last segment of binary data doesn't contain a full 6 bits
So RU5UUkVHQUdSQVRJU1== doesn't follow the encoding pattern.
Use Underline character "_" as Padding Character and Decode With Pad Bytes Deleted
For some reason dart:convert's base64.decode chokes on strings padded with = with the "invalid encoding before padding error". This happens even if you use the package's own padding method base64.normalize which pads the string with the correct padding character =.
= is indeed the correct padding character for base64 encoding. It is used to fill out base64 strings when fewer than 24 bits are available in the input group. See RFC 4648, Section 4.
However, RFC 4648 Section 5 which is a base64 encoding scheme for Urls uses the underline character _ as padding instead of = to be Url safe.
Using _ as the padding character will cause base64.decode to decode without error.
In order to further decode the generated list of bytes to Utf8, you will need to delete the padding bytes or you will get an "Invalid UTF-8 byte" error.
See the code below. Here is the same code as a working dartpad.dev example.
import 'dart:convert';
void main() {
//String message = 'RU5UUkVHQUdSQVRJU1=='; //as of dart 2.18.2 this will generate an "invalid encoding before padding" error
//String message = base64.normalize('RU5UUkVHQUdSQVRJU1'); // will also generate same error
String message = 'RU5UUkVHQUdSQVRJU1';
print("Encoded String: $message");
print("Decoded String: ${decodeB64ToUtf8(message)}");
}
decodeB64ToUtf8(String message) {
message =
padBase64(message); // pad with underline => ('RU5UUkVHQUdSQVRJU1__')
List<int> dec = base64.decode(message);
//remove padding bytes
dec = dec.sublist(0, dec.length - RegExp(r'_').allMatches(message).length);
return utf8.decode(dec);
}
String padBase64(String rawBase64) {
return (rawBase64.length % 4 > 0)
? rawBase64 += List.filled(4 - (rawBase64.length % 4), "_").join("")
: rawBase64;
}
The string RU5UUkVHQUdSQVRJU1== is not a compliant base 64 encoding according to RFC 4648, which in section 3.5, "Canonical Encoding," states:
The padding step in base 64 and base 32 encoding can, if improperly
implemented, lead to non-significant alterations of the encoded data.
For example, if the input is only one octet for a base 64 encoding,
then all six bits of the first symbol are used, but only the first
two bits of the next symbol are used. These pad bits MUST be set to
zero by conforming encoders, which is described in the descriptions
on padding below. If this property do not hold, there is no
canonical representation of base-encoded data, and multiple base-
encoded strings can be decoded to the same binary data. If this
property (and others discussed in this document) holds, a canonical
encoding is guaranteed.
In some environments, the alteration is critical and therefore
decoders MAY chose to reject an encoding if the pad bits have not
been set to zero. The specification referring to this may mandate a
specific behaviour.
(Emphasis added.)
Here we will manually go through the base 64 decoding process.
Taking your encoded string RU5UUkVHQUdSQVRJU1== and performing the mapping from the base 64 character set (given in "Table 1: The Base 64 Alphabet" of the aforementioned RFC), we have:
R U 5 U U k V H Q U d S Q V R J U 1 = =
010001 010100 111001 010100 010100 100100 010101 000111 010000 010100 011101 010010 010000 010101 010001 001001 010100 110101 ______ ______
(using __ to represent the padding characters).
Now, grouping these by 8 instead of 6, we get
01000101 01001110 01010100 01010010 01000101 01000111 01000001 01000111 01010010 01000001 01010100 01001001 01010011 0101____ ________
E N T R E G A G R A T I S P
The important part is at the end, where there are some non-zero bits followed by padding. The Dart implementation is correctly determining that the padding provided doesn't make sense provided that the last four bits of the previous character do not decode to zeros.
As a result, the decoding of RU5UUkVHQUdSQVRJU1== is ambiguous. Is it ENTREGAGRATIS or ENTREGAGRATISP? It's precisely this reason why the RFC states, "These pad bits MUST be set to zero by conforming encoders."
In fact, because of this, I'd argue that an implementation that decodes RU5UUkVHQUdSQVRJU1== to ENTREGAGRATIS without complaint is problematic, because it's silently discarding non-zero bits.
The RFC-compliant encoding of ENTREGAGRATIS is RU5UUkVHQUdSQVRJUw==.
The RFC-compliant encoding of ENTREGAGRATISP is RU5UUkVHQUdSQVRJU1A=.
This further highlights the ambiguity of your input RU5UUkVHQUdSQVRJU1==, which matches neither.
I suggest you check your encoder to determine why it's providing you with non-compliant encodings, and make sure you're not losing information as a result.

can not decoed using utf-8 after encoding with utf-8

In a situation I had to store data as utf-8 and now when I want to fetch and decode('utf-8') data it's just simply does not work. Consider line below as an example:
\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
You can simply copy the line below to convert the string above to the human readable format:
b"\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87".decode("utf-8")
However could not find a way to convert the string to bytestring without corrupting the string. I tried following methods but all of them failed:
.decode("utf-8")
.decode()
.bytes()
Up until this point I could not find solution in OS or other places. Appreciate any help.
x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87
b'x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87'
The above lines (both given in the question) are particular instances of String and Bytes literals (respectively):
\xhh Character with hex value hh (2, 3)
2 Unlike in Standard C, exactly two hex digits are
required.
3 In a bytes literal, hexadecimal and octal escapes denote
the byte with the given value. In a string literal, these escapes
denote a Unicode character with the given value.
Let's check the string defined in such a way (inside Python prompt):
>>> xstr = "\x0d\x0a\xd8\xb3\xd8\xa7\xd9\x82\xdb\x8c\xe2\x80\x8c\xd9\x86\xd8\xa7\xd9\x85\xd9\x87"
>>> xstr
'\r\nساÙ\x82Û\x8câ\x80\x8cÙ\x86اÙ\x85Ù\x87'
>>> print( xstr)
ساÙÛâÙاÙ
Ù
>>>
Apparently, the print( xstr) output does not resemble a word in any known language however all its characters belong (by definition) to Unicode range r'[\u0000-\u00ff]' i.e. the first 256 of characters in Unicode, and voila - it's iso-8859-1 aka 'latin1'.
We need to get an encoded version of the xstr string as a bytes object, e.g. using str.encode method or built-in bytes() function. Then
print( bytes(xstr,'latin1').decode()); print(xstr.encode("latin1").decode())
ساقی‌نامه
ساقی‌نامه

String Index Error (Julia)

I'm a Julia newbie. When I was testing out the language, I got this error.
First of all, I'm defining String b to "he§y".
Julia seems behaving strangely when I have "special" characters in a String...
When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK
However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.
I don't believe the compiler could throw you the error. Do you mean a runtime error?
I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.
The document from Julia lang seems supporting my hypothesis:
https://docs.julialang.org/en/stable/manual/strings/
The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)
...
Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
Whether these Unicode characters are displayed as escapes or shown as
special characters depends on your terminal's locale settings and its
support for Unicode. String literals are encoded using the UTF-8
encoding. UTF-8 is a variable-width encoding, meaning that not all
characters are encoded in the same number of bytes. In UTF-8, ASCII
characters – i.e. those with code points less than 0x80 (128) – are
encoded as they are in ASCII, using a single byte, while code points
0x80 and above are encoded using multiple bytes – up to four per
character. This means that not every byte index into a UTF-8 string is
necessarily a valid index for a character. If you index into a string
at such an invalid byte index, an error is thrown:
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]
julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Representing a non-numeric string as in integer in Julia

I'm looking for a simple and invertible way to represent a Julia string by an integer (e.g. for cryptography). To be clear, I'm not considering string representations of integers like "123", but arbitrary strings like "Hello". The representation doesn't need to be human-readable, but it needs to be easily invertible back to a unique string (so not a hash). It doesn't need to be efficient; I'm just looking for something as simple as possible. (Also, it's fine if it only works on a small character set, e.g. lowercase Roman letters.)
One naive way would be to collect the string into a vector of chars, parse(Int, _) each char to an integer, and concatenate the integers. But this seems cumbersome, and I suspect that there's in built-in Julia function (or small composition of functions) that will get the job done more easily.
If your strings only use the numbers 0-9 and letters a-z and A-Z, then you can parse the string directly as base 62 BigInteger:
julia> s = randstring(123)
"RFXkzD6VpWcwvbsxOtdTxS4DGcgciKgDXECa9fEK0Djcdkcj5N75vIHEMVyuH9mcYgvFbLhbPdrKyPIO4JsK1DKgZIacov6WKDZdIpGJ5iJ15dpjmcCBCybMmxB"
julia> i = parse(BigInt, s, base=62)
12798646956721889529517502411501433963894611324020956397632780092623456213685688389093681112679380669903728068303911743800989012987014660454736389459814982802097607808640628339365945710572579898457023165244164689548286133
julia> string(i, base=62)
"RFXkzD6VpWcwvbsxOtdTxS4DGcgciKgDXECa9fEK0Djcdkcj5N75vIHEMVyuH9mcYgvFbLhbPdrKyPIO4JsK1DKgZIacov6WKDZdIpGJ5iJ15dpjmcCBCybMmxB"
I created a (somewhat complicated) implementation that works for ASCII strings:
stringToInt(str::String) = sum(i -> Int(str[end-i]) * 128^i, 0:length(str)-1)
function intToString(m::Int)
chars = Char[]
for n in div(ceil(Int, log2(x)), 7)-1:-1:0
d, m = divrem(m, 128^n)
push!(chars, d)
end
String(chars)
end
Let me know if you can think of a better one.

How to replace several characters in a string using Julia

I'm essentially trying to solve this problem: http://rosalind.info/problems/revc/
I want to replace all occurrences of A, C, G, T with their compliments T, G, C, A .. in other words all A's will be replaced with T's, all C's with G's and etc.
I had previously used the replace() function to replace all occurrences of 'T' with 'U' and was hoping that the replace function would take a list of characters to replace with another list of characters but I haven't been able to make it work, so it might not have that functionality.
I know I could solve this easily using the BioJulia package and have done so using the following:
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
using Bio.Seq
s = dna"AAAACCCGGT"
t = reverse(complement(s))
println("$t")
But I'd like to not have to rely on the package.
Here's the code I have so far, if someone could steer me in the right direction that'd be great.
# creating complementary strand of DNA
# reverse the string
# find the complementary nucleotide
s = open("nt.txt") # open file containing sequence
t = reverse(s) # reverse the sequence
final = replace(t, r'[ACGT]', '[TGCA]') # this is probably incorrect
# replace characters ACGT with TGCA
println("$final")
It seems that replace doesn't yet do translations quite like, say, tr in Bash. So instead, here are couple of approaches using a dictionary mapping instead (the BioJulia package also appears to make similar use of dictionaries):
compliments = Dict('A' => 'T', 'C' => 'G', 'G' => 'C', 'T' => 'A')
Then if str = "AAAACCCGGT", you could use join like this:
julia> join([compliments[c] for c in str])
"TTTTGGGCCA"
Another approach could be to use a function and map:
function translate(c)
compliments[c]
end
Then:
julia> map(translate, str)
"TTTTGGGCCA"
Strings are iterable objects in Julia; each of these approaches reads one character in turn, c, and passes it to the dictionary to get back the complimentary character. A new string is built up from these complimentary characters.
Julia's strings are also immutable: you can't swap characters around in place, rather you need to build a new string.

Resources