IMAP search command with UTF-8 charset in C#

IMAP search command with UTF-8 charset in C# - search

C# Imap search command with special characters like á,é
I am trying to implement the logic mentioned in the above post in C# to achieve non-ascii based searches in gmail. After logging in successfully to imap.gmail.com I am having the following transaction with the server:
(C -> S) Encoding.Default.GetBytes("A4 UID SEARCH CHARSET UTF-8 TEXT {4}\r\n");
(C <- S) "+ go ahead\r\n"
(C -> S) Encoding.Default.GetBytes("αβγδ\r\n");
(C <- S) "* SEARCH 72\r\nA2 OK SEARCH completed (Success)"
However the email denoted by the response of the server is completely irrelevant to the search term I provided. This only happens when using non-ascii characters in the keywords and I believe I have something wrong with the encoding.
I have also tried using Encoding.Ascii but then I get search results that are even more off target.
What is the proper way to send the string literal: "αβγδ\r\n"

For the search term, you are using a so-called literal. The length of the literal has to be specified in octets. That's not the case in your example. The string "αβγδ" encoded in UTF-8 consists of more than four octets.
So, you should encode the search term before sending the length to the server.
I don't know much about C#. I make an example with Python:
search_term = 'Grüße'
encoded_search_term = search_term.encode('UTF-8')
length = str(len(encoded_search_term)).encode('ascii')
send(b'. UID SEARCH CHARSET UTF-8 TEXT {' + length + b'}\r\n')
read_until(br'^\+ .*$')
send(encoded_search_term + b'\r\n')
read_until(br'^\. OK .*$')
With this code, the search command returns the UIDs of the emails with the text "Grüße":
C: b'. UID SEARCH CHARSET UTF-8 TEXT {7}\r\n'
S: b'+ Ready for literal data\r\n'
C: b'Gr\xc3\xbc\xc3\x9fe\r\n'
S: b'* SEARCH 1 3 4\r\n'
S: b'. OK UID SEARCH completed\r\n'
If I use the length in characters (len(search_term)) instead of the encoded length in octets (len(encoded_search_term)), the IMAP server reports an error:
C: b'. UID SEARCH CHARSET UTF-8 TEXT {5}\r\n'
S: b'+ Ready for literal data\r\n'
C: b'Gr\xc3\xbc\xc3\x9fe\r\n'
S: b'. BAD expected end of data instead of "\\237e"\r\n'
Note, I didn't use Gmail for my tests.

Related

How do I decode a dictionary of bytes to utf-8?

I'm trying to figure out how to convert the values of a dictionary from bytes to strings as the backend only supports primitive types.
oledata = {
'macros': macros,
'data': analysis
}
s = str(oledata)
save_data_to_s3(json.dumps(s), ['olevba3'])
As you can see, the values of this dict are bytes. Now this code will execute without errors on my test sample but the output has the b' prefix in front of the values (data), which will break the database. Dict's also have no decode() functionality which is why I used str(), but it must be doing something wrong since the values are still coming out with the b' prefix. Which leads to my general question, how do you decode the values of a dictionary to utf-8 format?

my_str = b"Hello" # b means its a byte string
new_str = my_str.decode('utf-8') # Decode using the utf-8 encoding
print(new_str)

String Index Error (Julia)

I'm a Julia newbie. When I was testing out the language, I got this error.
First of all, I'm defining String b to "he§y".
Julia seems behaving strangely when I have "special" characters in a String...
When I'm trying to get the third character of b (it's supposed to be '§'), everything is OK
However when I'm trying to get the fourth character of b (it's supposed to be 'y'), a "StringIndexError" is thrown.

I don't believe the compiler could throw you the error. Do you mean a runtime error?
I know nothing about Julian language but the symptoms seems to be related to indexing of string is not based on code point, but to some encoding.
The document from Julia lang seems supporting my hypothesis:
https://docs.julialang.org/en/stable/manual/strings/
The built-in concrete type used for strings (and string literals) in Julia is String. This supports the full range of Unicode characters via the UTF-8 encoding. (A transcode function is provided to convert to/from other Unicode encodings.)
...
Conceptually, a string is a partial function from indices to characters: for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings.
Edit: Quoted from Julia document, which is an example demonstrating exact "problem" you are facing.
julia> s = "\u2200 x \u2203 y"
"∀ x ∃ y"
Whether these Unicode characters are displayed as escapes or shown as
special characters depends on your terminal's locale settings and its
support for Unicode. String literals are encoded using the UTF-8
encoding. UTF-8 is a variable-width encoding, meaning that not all
characters are encoded in the same number of bytes. In UTF-8, ASCII
characters – i.e. those with code points less than 0x80 (128) – are
encoded as they are in ASCII, using a single byte, while code points
0x80 and above are encoded using multiple bytes – up to four per
character. This means that not every byte index into a UTF-8 string is
necessarily a valid index for a character. If you index into a string
at such an invalid byte index, an error is thrown:
julia> s[1]
'∀': Unicode U+2200 (category Sm: Symbol, math)
julia> s[2]
ERROR: StringIndexError("∀ x ∃ y", 2)
[...]
julia> s[3]
ERROR: StringIndexError("∀ x ∃ y", 3)
Stacktrace:
[...]
julia> s[4]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Lua gmatch odd characters (Slovak alphabet)

I am trying to extract the characters from a string of a word in Slovak. For example, the word for "TURTLE" is "KORYTNAČKA". However, it skips over the "Č" character when I try to extract it from the string:
local str = "KORYTNAČKA"
for c in str:gmatch("%a") do print(c) end
--result: K,O,R,Y,T,N,A,K,A
I am reading this page and I have also tried just pasting in the string itself as a set, but it comes up with something weird:
local str = "KORYTNAČKA"
for c in str:gmatch("["..str.."]") do print(c) end
--result: K,O,R,Y,T,N,A,Ä,Œ,K,A
Anyone know how to solve this?

Lua is 8-bit clean, which means Lua strings assume every character is one byte. The pattern "%a" matches one-byte character, so the result is not what you expected.
The pattern "["..str.."]" works because, a Unicode character may contain more than one byte, in this pattern, it uses these bytes in a set, so that it could match the character.
If UTF-8 is used, you can use the pattern "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" to match a single UTF-8 byte sequence in Lua 5.2, like this:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
In Lua 5.1(which is the version Corona SDK is using), use this:
local str = "KORYTNAČKA"
for c in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(c)
end
For details about this pattern, see Equivalent pattern to “[\0-\x7F\xC2-\xF4][\x80-\xBF]*” in Lua 5.1.

Lua has no built-in treatment for Unicode strings. You can see that Ä,Œ is a 2 bytes representing UTF-8 encoding of a Č character.
Yu Hao already provided sample solution, but for more details here is good source.
I've tested and found this solution working properly in Lua 5.1, reserve link. You could extract individual characters using utf8sub function, see sample.

string.gmatch(str, "[%z\1-\127\192-\253][\128-\191]*")

Use utf8 plugin. Then replace string.gmatch with utf8.gmatch.
Example (tested on Win7, it works for me)
yourfilename.lua
local utf8 = require( "plugin.utf8" )
for c in utf8.gmatch( "KORYTNAČKA", "%a" ) do print(c) end
and
build.settings
settings =
{
plugins =
{
["plugin.utf8"] =
{
publisherId = "com.coronalabs"
},
},
}
Read more :
Introducing the UTF-8 string plugin,
Documenatation for utf8 plugin,
Lua String Manipulation,
string.gmatch().
Have a nice day:)

Erlang howto make a list from this binary <<"a,b,c">>

I have a binary <<"a,b,c">> and I would like to extract the information from this binary.
So I would like to have something like A=a, B=b and so on.
I need a general approach on this because the binary string always changes.
So it could be <<"aaa","bbb","ccc">>...
I tried to generate a list
erlang:binary_to_list(<<"a","b","c">>)
but I get string as a result.
"abc"
Thank you.

You did use the right method.
binary_to_list(Binary) -> [char()]
Returns a list of integers which correspond to the bytes of Binary.
There is no string type in Erlang: http://www.erlang.org/doc/reference_manual/data_types.html#id63119. The console just displays the lists in string representation as a courtesy, if all elements are in printable ASCII range.
You should read Erlang's "Bit Syntax Expressions" documentation to understand how to work on binaries.
Do not convert the whole binary into a list if you don't need it in list representation!
To extract the first three bytes you could use
<<A, B, C, Rest/binary>> = <<"aaa","bbb","ccc">>.
If you want to iterate over the binary data, you can use binary comprehension.
<< <<(F(X))>> || <<X>> <= <<"aaa","bbb","ccc">> >>.
Pattern matching is possible, too:
test(<<A, Tail/binary>>, Accu) -> test(Tail, Accu+A);
test(_, Accu) -> Accu.
882 = test(<<"aaa","bbb","ccc">>, 0).
Even for reading one UTF-8 character at once. So to convert a binary UTF-8 string into Erlang's "list of codepoints" format, you could use:
test(<<A/utf8, Tail/binary>>, Accu) -> test(Tail, [A|Accu]);
test(_, Accu) -> lists:reverse(Accu).
[97,97,97,600,99,99,99] = test(<<"aaa", 16#0258/utf8, "ccc">>, "").
(Note that `<<"aaa","bbb","ccc">> = <<"aaabbbccc">>. Don't actually use the last code snipped but the linked method.)

Sentence that uses every base64 character

I am trying to construct a sentence/letter combination that will return every base64 character, but failing to find a word for purposes of unit testing.
The unit tests I have so far are failing to hit the lines that handle the + and / characters. While I can sling a them at the encoder/decoder directly it would be nice to have a human readable source (the base64 equivalent of 'the quick brown dog').

Here is a Base64 encoded test string that includes all 64 possible Base64 symbols:
char base64_encoded_test[] =
"U28/PHA+VGhpcyA0LCA1LCA2LCA3LCA4LCA5LCB6LCB7LCB8LCB9IHRlc3RzIEJhc2U2NCBlbmNv"
"ZGVyLiBTaG93IG1lOiBALCBBLCBCLCBDLCBELCBFLCBGLCBHLCBILCBJLCBKLCBLLCBMLCBNLCBO"
"LCBPLCBQLCBRLCBSLCBTLCBULCBVLCBWLCBXLCBYLCBZLCBaLCBbLCBcLCBdLCBeLCBfLCBgLCBh"
"LCBiLCBjLCBkLCBlLCBmLCBnLCBoLCBpLCBqLCBrLCBsLCBtLCBuLCBvLCBwLCBxLCByLCBzLg==";
char base64url_encoded_test[] =
"U28_PHA-VGhpcyA0LCA1LCA2LCA3LCA4LCA5LCB6LCB7LCB8LCB9IHRlc3RzIEJhc2U2NCBlbmNv"
"ZGVyLiBTaG93IG1lOiBALCBBLCBCLCBDLCBELCBFLCBGLCBHLCBILCBJLCBKLCBLLCBMLCBNLCBO"
"LCBPLCBQLCBRLCBSLCBTLCBULCBVLCBWLCBXLCBYLCBZLCBaLCBbLCBcLCBdLCBeLCBfLCBgLCBh"
"LCBiLCBjLCBkLCBlLCBmLCBnLCBoLCBpLCBqLCBrLCBsLCBtLCBuLCBvLCBwLCBxLCByLCBzLg==";
It decodes to a string composed entirely of relatively human-readable text:
char test_string[] = "So?<p>"
"This 4, 5, 6, 7, 8, 9, z, {, |, } tests Base64 encoder. "
"Show me: #, A, B, C, D, E, F, G, H, I, J, K, L, M, "
"N, O, P, Q, R, S, T, U, V, W, X, Y, Z, [, \\, ], ^, _, `, "
"a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s.";
This decoded string contains only letters in the limited range of isprint()'able 7-bit ASCII characters (space through '~').
Since I did it, I would argue that it is possible :-).

You probably can't do that.
/ in base64 encodes 111111 (6 '1' bits).
As all ASCII (which are the type-able and printable characters) are in the range of 0-127 (i.e. 00000000 and 01111111), the only ASCII character that could be encoded using '/' is the ASCII character with the code 127, which is the non-printable DEL character.
If you allow values higher than 127, you could have a printable but non-typeable string.

When attempting to encode/decode, this is the one place where I break the rule of unit testing a single method at once. You can have methods for encoding or decoding separately, but the only way to tell if you're doing it correctly is to use both encoding and decoding in a single assert. I would use the following psuedo code.
Generate a random string using Path.GetRandomFilename() this string is cryptographically strong
Pass the string to the encode method
Pass the output of the encode to the decode method
Assert.AreEqual(input from GetRandomFilename, output from Decode)
You can loop over this as many times as you want in order to say it's tested. You can also cover some specific cases; however, since encoding (sometimes) differs based on the positioning of the letters, you're better off going with a random string and just calling encode/decode about 50 or so times.
If you find that encoding/decoding fails in accepted scenarios, create unit tests for those and filter out the strings that contain those characters/character combinations. Also, document those failures in XMLDocs comments, code comments, and any documentation your app has.

What I came up with, may prove not unuseful. Needs to be entered exactly as is: I include a link to a screenshot showing all the usually invisible characters below, as well as the Base64 data string to which it converts, and a table of the relevant statistics pertinent to each of the 64 characters therein.
<HTML><HEAD></HEAD><BODY><PRE>
Did
THE
THE QUICK BROWN FOX
jump
over
the
lazy
dogs
or
was
he
pushed
?
</PRE><B>hmm.</B></BODY><HTML>
ÿß®Þ~c*¯/
This encodes to the Base64 string:
PEhUTUw+PEhFQUQ+PC9IRUFEPjxCT0RZPjxQUkU+DQpEaWQJDQoNCiBUSEUJDQoNCiAgVEhFIFFVSUNLIEJST1dOIEZPWAkNCg0KICAganVtcAkNCg0KICAgIG92ZXIJDQoNCiAgICAgdGhlCQ0KDQogICAgICBsYXp5CQ0KDQogICAgICAgZG9ncwkNCg0KICAgICAgICBvcgkNCg0KICAgICAgICAgd2FzCQ0KDQogICAgICAgICAgaGUJDQoNCiAgICAgICAgICAgcHVzaGVkCQ0KDQogICAgICAgICAgICA/CQ0KDQo8L1BSRT48Qj5obW0uPC9CPjwvQk9EWT48SFRNTD4NCg0KDQoNCg0KDQoNCg//367efmMqry/==
which contains
5--/'s
4--+'s
3--='s
14--0's
3--1's
3--2's
2--3's
4--4's
3--5's
2--6's
2--7's
4--8's
6--9's
5--a's
27--A's
2--b's
5--B's
5--c's
4--C's
4--d's
14--D's
2--e's
10--E's
2--f's
8--F's
36--g's
6--G's
5--h's
2--H's
5--i's
30--I's
5--j's
6--J's
8--k's
12--K's
2--l's
3--L's
2--m's
4--M's
3--n's
14--N's
13--o's
2--O's
3--p's
9--P's
2--q's
24--Q's
2--r's
5--R's
2--s's
6--S's
2--t's
7--T's
2--u's
1--U's
3--v's
6--V's
4--w's
5--W's
3--x's
6--X's
2--y's
4--Y's
3--z's
5--Z's

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string