Converting surrogate pairs to emoji - python3 - python-3.x

I found the solution to similar question on the other topic, but unfortunately it's not working for me.
Here is my problem:
I'm making dataframe from the surrogatepairs unicodes which I'd like to search for in another file (example: "\uD83C\uDFF3", "\u26F9", "\uD83C\uDDE6\uD83C\uDDE8"):
with open("unicodes.csv", "rt") as csvfile:
emoticons = pd.read_csv(csvfile, names=["xy"])
emoticons = pd.DataFrame(emoticons)
emoticons = emoticons.astype(str)
Next I'm reading my file with text where some lines contain surrogate pairs unicodes:
for chunk in pd.read_csv(path, names=["xy"], encoding="utf-8", chunksize=chunksize):
spam = pd.DataFrame(chunk)
spam = spam.astype(str)
In this for loop I'm checking if line contains surrogatepairs unicode, and if it's true, then I'd like to print this surrogatepair unicode as emoji - that's why I'm encoding and decoding this "i" value which is str:
(solution from: How to work with surrogate pairs in Python?)
for i in emoticons.xy:
if spam["xy"].str.contains(i, regex=False).any():
print(i.encode('utf-16', 'surrogatepass').decode('utf-16'))
#printing:
#\uD83C\uDFF3
#\u26F9
#\uD83C\uDDE6\uD83C\uDDE8
So, when I start the program it still prints surrogatepairs unicode as str, not as emoji, but when I input surrogatepair unicode into print function by myself, it works:
print("\uD83C\uDFF3".encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass"))
#printing:
#🏳
What am I doing wrong? I tried to make string from this i and another solutions, but it still doesn't work.
EDIT:
hexdump -C file.csv
00004b70 5c 75 44 38 33 44 5c 75 44 45 45 39 0a 5c 75 44 |\uD83D\uDEE9.\uD|
00004b80 38 33 44 5c 75 44 45 45 42 0a 5c 75 44 38 33 44 |83D\uDEEB.\uD83D|
00004b90 5c 75 44 45 45 43 0a 5c 75 44 38 33 44 5c 75 44 |\uDEEC.\uD83D\uD|
00004ba0 43 42 41 0a 5c 75 44 38 33 44 5c 75 44 45 38 31 |CBA.\uD83D\uDE81|
EDIT2:
So I've found something kind of working, but still need an improvement:
https://stackoverflow.com/a/54918256/4789281
Text from my another file which I want to convert looks file:
"O żółtku zapomniałaś \uD83D\uDE02"
"Piękny outfit \uD83D\uDE0D"
When I'm doing this what was recommended in another topic:
print(codecs.decode(i,encoding='unicode_escape',errors='surrogateescape').encode('utf-16', 'surrogatepass').decode('utf-16'))
I've got something like this:
O żóÅtku zapomniaÅaÅ 😂
PiÄkny outfit 😍
So my surrogatepairs are replaced, but my polish characters are replaced with something strange.

You are along the right track.
WHat you are trying to do breaks because what you have in your "str" after you read the file are not "surrogate pairs" - instead, they are backslash-encoded codepoints for your surrogate pairs, encoded as text.
That is: the sequence "5c 75 44 38 33 44" in your file are the ACTUAL ascii characters "\uD83D" (6 characters in total), not the surrogate codepoint 0xD83D (which, when properly decoded, along with the next surrogate "\uDE0D" will be a single character in your string).
The part I said you are on the right track is: you really have to encode this into a bytes-sequence, and then decode it back. What is wrong is that you have to encode it using "latin1" (just to try to preserve any other non-ascii char you have on the string- it may break if you have codepoints not representable in latin1), and decode it back with the special "unicode escape" codec. or a charmap encoding, that will preserve your other characters on the string, and then decode it back, using the same codec. At that point, both surrogate characters will be text as two characters in a Python string:
In [16]: "\\uD83D\\uDE0D".encode("latin1").decode("unicode escape", "surrogatepass")
Out[16]: '\ud83d\ude0d'
The bad news is - that is not a very valid STR - the surrogate characters should not exist by themselves in the internal representation - instead, they should be combined in to the desired final character. So, trying to print that out will break:
In [19]: a = "\\uD83D\\uDE0D".encode("utf-8").decode("unicode escape")
In [20]: print(a)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-20-bca0e2660b9f> in <module>
----> 1 print(a)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
Using "surrogatepass" error policy here will be of no help - you will get an unprintable bytesequence.
Therefore, for a second time, this have to be "encoded" and "decoded" - this time, the characters you have in text are actual "surrogate" codepoints that would be valid utf-16 to be decoded. So, the path now is to encode this sequence, brute-forcing these chars with "surrogatepass", and decode then back from utf-16 - which will finally understand the surrogate pair as a single character:
In [30]: a = "\\uD83D\\uDE0D".encode("unicode escape").decode("unicode escape")
In [31]: a
Out[31]: '\ud83d\ude0d'
In [32]: b = a.encode("utf-16", "surrogatepass").decode("utf-16")
In [33]: print(b)
😍
Summarising:
You read your file as utf-8 text, to read possible other non-ascii characters,
encode the result as "unicode escape" and decode it back - this will convert the
extended human readable "\uXXXX" sequences in your file as the surrogate codepoints. Then you convert it back to utf-16, telling Python to ignore
surrogates and copy then "as is", and decode back from utf-16:
def decode_surrogate_ascii(txt):
interm = txt.encode("latin1").decode("unicode escape")
return interm.encode("utf-16", "surrogatepass").decode("utf-16")
All you have to do is to apply the above function in the columns of interest on your data frame:
emoticons = emoticons.apply(pd.Series(lambda row: (decode_surrogate_ascii(item) if isinstance(item, str) else item for item in row ))

Related

How to interpret numbers with leading zero's as decimal?

all!
I need to interpret numbers given with leading zeros in python as decimal. ( on input there are numbers, not strings!)
Python2 is used, in python3 there are no more such problem.
I haven't ideas how to do this.
Anybody help me please!!!
example:
id = 0101
print id
# will print 65 and I need 101
id = 65
print id
# will print 65 - ok
possible solution:
id = 0101
id = oct(id).lstrip('0')
print id
# will print 101 - ok
id = 65
id = oct(id).lstrip('0')
print id
# will print 101 - wrong, need 65
It is a normal behaviour for Python2. This kind of numbers are specified in the language:
octinteger ::= "0" ("o" | "O") octdigit+ | "0" octdigit+
"0" octdigit+ - numbers that are started with "0" - are octal by design. You can't change this behaviour.
If you want to interpret 077 as 77, the most you can do is some kind of ugly transformations like it:
int(str(oct(077)).lstrip('0'))
Can you cast it to string?
For example:
def func(rawNumber):
id = str(rawNumber)
if id[0] == '0':
res = oct(id).lstrip('0')
else:
res = id
return int(res)
# then use it like this:
print(func(0101)) # will print 101
print(func(65)) # will print 65

block cipher in CTR mode manipulation

I have a question about block cipher using CTR mode. I think i need to find something (value) that when I do 46 XOR value = 43, I get value to be (1011 1101), then i use 0x64(0110 0100) XOR value(1011 1101) but it does not give me 0x72(0111 0010). Did I miss something here? My upstanding is that in order to do this, all i need to do is to find a value that adds counter (in this case is zero) and xor the plain text to get cipher text. Did I miss something here? Thank you in advance.
You know that the 2nd and 3rd block were created by the same key stream (created by concatenating the counter values encrypted by the block cipher).
So for the first byte of the second block you'd have 46 = 43 ^ KK and 51 = P2 ^ KK where KK is the first byte of the key stream. Now KK can be easily calculated, as KK = 46 ^ 43 (KK = 05 if I'm not mistaken). Now P2 = KK ^ 51 or P2 = 05 ^ 51 = 54.
You can simply repeat that for each index into the streams and presto. You don't have to do anything with the counter itself; knowing that the same key and counter were used is enough to generate the same key stream.

How to slice off a certain number of bytes from a string in Python?

I am trying to write a specific number of bytes of a string to a file. In C, this would be trivial: since each character is 1 byte, I would simply write however many characters from the string I want.
In Python, however, since apparently each character/string is an object, they are of varying sizes, and I have not been able to find how to slice the string at byte-level specificity.
Things I have tried:
Bytearray:
(For $, read >>>, which messes up the formatting.)
$ barray = bytearray('a')
$ import sys
$ sys.getsizeof(barray[0])
24
So turning a character into a bytearray doesn't turn it into an array of bytes as I expected and it's not clear to me how to isolate individual bytes.
Slicing byte objects as described here:
$ value = b'a'
$ sys.getsizeof(value[:1])
34
Again, a size of 34 is clearly not 1 byte.
memoryview:
$ value = b'a'
$ mv = memoryview(value)
$ sys.getsizeof(mv[0])
34
$ sys.getsizeof(mv[0][0])
34
ord():
$ n = ord('a')
$ sys.getsizeof(n)
24
$ sys.getsizeof(n[0])
Traceback (most recent call last):
File "<pyshell#29>", line 1, in <module>
sys.getsizeof(n[0])
TypeError: 'int' object has no attribute '__getitem__'
So how can I slice a string into a particular number of bytes? I don't care if slicing the string actually leads to individual characters being preserved or anything as with C; it just has to be the same each time.
Make sure the string is encoded into a byte array (this is the default behaviour in Python 2.7).
And then just slice the string object and write the result to file.
In [26]: s = '一二三四'
In [27]: len(s)
Out[27]: 12
In [28]: with open('test', 'wb') as f:
....: f.write(s[:2])
....:
In [29]: !ls -lh test
-rw-r--r-- 1 satoru wheel 2B Aug 24 08:41 test

Convert a string to Hex in VB

In VB, I'm reading a file line by line using IO.File.Readline() Method. Each line of the file contains a string similar to the following
":1A2C003F4EDCFE3A2F5D66\r\n"
Now for each line I read, All I want to do is
1. Remove the ":" and "\r\n" from the line
2. pair the values as bytes e.g:"1A 2C 00"... (Now the line would be "1A 2C 00 3F 4E DC FE 3A 2F 5D 66")
3. Add all the bytes together and to find the result is zero or not. e.g: (1A+2C+00+3F+4E+DC+FE+3A+2F+5D+66)=0?
How can I proceed?
So far I have done
While endofstream = False
stringReader = fileReader.ReadLine()
If stringReader.StartsWith(":") Then
stringReader = stringReader.Replace(vbCr, "")
stringReader = stringReader.Replace(":", "")
MsgBox(stringReader)
But be careful. Shouldn't you have parts of 4 characters? 1A2C 003F 4EDC ...
All you have to do is to convert hex to decimal numbers and sum them
Dim sum As Integer
For index As Integer = 0 To stringReader .Length-1 Step 2
' we take 2 chars
' we use ToInt32 method http://msdn.microsoft.com/en-us/library/f1cbtwff.aspx
sum += Convert.ToInt32(stringReader.Chars(index) & stringReader.Chars(index+1), 16)
Next
' use sum
In my case, the result is 985

Perl CGI: get request not working unless URL is hard-coded

I'm trying to identify the differences between the two strings during a URL get request (using LWP::Simple).
I have a URL, say http://www.example.com?param1=x&param2=y&param3=z
I make sure any blank inputs are also taken care of, but that is irrelevant at this point, because I am making sure all parameters are exactly the same.
Also, the hard-coded URL is copied and pasted from the generated URL.
This URL works when I do the following:
my $url = "http://www.example.com?param1=x&param2=y&param3=z";
my $content = get($url);
Yet, when I build the URL from parameters provided by a user, the get request does not work (Error: 500 from the site).
I have compared the two URLs by printing them out, and see zero differences. I've tried removing all of the potential invisible characters.
The output for the generated code and static string, assuming user input is the same as the static string (which is what I'm making sure to do):
http://www.example.com?param1=x&param2=y&param3=z
http://www.example.com?param1=x&param2=y&param3=z
I'm assuming printing the outputs removes characters I can't see.
I've also followed a solution at http://www.perlmonks.org/?node_id=882590 and it is pointing out differences, but I don't know why, considering I see none at all.
Has anyone run into this problem before? Please let me know if I need to clarify anything or need to provide additional information.
EDIT: Problem and Solution
So, after using mob's suggestion to identify differences, I found there was a null character in the generated URL that was not getting printed in the output. That is: http://www.example.com?param1=x&param2=y&param3=z was actually http://www.example.com?param1=x&param2=y&param3=\000z.
I used a simple regex: $url =~ s/\000//g; to remove that (and any other) null value.
Use a data serialization function to inspect your strings for hidden characters.
$url1 = "http://www.example.com?param1=x&param2=y";
$url2 = "http://www.example.com?param1=x&param2=y\0";
$url3 = "http://www.example.com?param1=x&param2=y\n";
use JSON;
print JSON->new->pretty(1)->encode( [$url1,$url2,$url3] );
# Result:
# [
# "http://www.example.com?param1=x&param2=y",
# "http://www.example.com?param1=x&param2=y\u0000",
# "http://www.example.com?param1=x&param2=y\n"
# ]
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper($url1,$url2,$url3);
# Result:
# $VAR1 = "http://www.example.com?param1=x&param2=y";
# $VAR2 = "http://www.example.com?param1=x&param2=y\0";
# $VAR3 = "http://www.example.com?param1=x&param2=y\n";
Clearly the string you have built is different from the hard-coded one. If you write code like this
my $ss = 'http://www.example.com?param1=x&param2=y&param3=z';
print join(' ', map " $_", $ss =~ /./g), "\n";
print join(' ', map sprintf('%02X', ord), $ss =~ /./g), "\n";
then you will be able to see the hex value of each character in the string, and you can compare the two of them more accurately. For instance, the code above outputs
h t t p : / / w w w . e x a m p l e . c o m ? p a r a m 1 = x & p a r a m 2 = y & p a r a m 3 = z
68 74 74 70 3A 2F 2F 77 77 77 2E 65 78 61 6D 70 6C 65 2E 63 6F 6D 3F 70 61 72 61 6D 31 3D 78 26 70 61 72 61 6D 32 3D 79 26 70 61 72 61 6D 33 3D 7A

Resources