how to handle list that contains emoji in Python3

how to handle list that contains emoji in Python3 - python-3.x

I've been making function that takes list that has only emoji and transfer it to utf-8 unicode and return the unocode list . My current code seems to take multiple args and return error . I'm new to handling emoji . Could you give me some tips ??
main.py
def encode_emoji(emoji_list):
result = []
for i in range(len(emoji_list)):
emoji = str(emoji_list[i])
d_ord = format(ord(":{}:","#08x").format(emoji))
result.append(str(d_ord))
break
return result
encode_emoji(["😀","😃","😄"])
Result of above code
Traceback (most recent call last):
File "main.py", line 11, in <module>
encode_emoji(["😀","😃","😄"])
File "main.py", line 5, in encode_emoji
d_ord = format(ord(":{}:","#08x").format(emoji))
TypeError: ord() takes exactly one argument (2 given)

I have no idea of how you intend to get the utf-8 encoding of an emoji with this line:
d_ord = format(ord(":{}:","#08x").format(emoji))
As the error message says, ord would take a single argument: a 1-character long string, and return an integer. Now, even if the code above would be placed so that the value returned by ord(emoji) was correctly concatenated to 0x8 as a prefix, that would basically be an specific representation of a basically random hexadecimal number - not the utf-8 sequence for the emoji.
To encode some text into utf-8, just call the encode method of the string itself.
Also, in Python, one almost never will use the for... in range(len(...)) pattern, as for is well designed to iterate over any sequence or iterable with no side effects.
Your code also have a loosely placed break statement that would stop any processing after the first character.
Without using the list-comprehension syntax, a function to encode emoji as utf-8 byte strings is just:
def encode_emoji(emoji_list):
result = []
for part in emoji_list:
result.append(part.encode("utf-8"))
Once you get more acquainted with the language and understand comprehensions, it is just:
def encode_emoji(emoji_list):
return [part.encode("utf-8") for part in emoji_list)]
Now, given the #8 pattern in your code, it may be that you have misunderstood what utf-8 means, and are simply trying to write down the emoji's as valid HTML encoded char references - that later will be embedded in text that will be encoded to utf-8.
In that case, you have indeed to call ord(emoji) to get its codepoint, but then represent the resulting number as hexadecimal, and replace the leading 0x Python's hex call yields with #:
def encode_emoji(emoji_list):
return [hex(ord(emoji)).replace("0x", "#") + ";" for emoji in emoji_list)]

TypeError: ord() takes exactly one argument (2 given)
I think the error is self-explanatory. That function takes one argument, but you are passing it two:
":{}"
"#08x"
Here some docs to read in case you need.

Related

Type Conversion to Integer in Python using int() [duplicate]

I got this error from my code:
ValueError: invalid literal for int() with base 10: ''.
What does it mean? Why does it occur, and how can I fix it?

The error message means that the string provided to int could not be parsed as an integer. The part at the end, after the :, shows the string that was provided.
In the case described in the question, the input was an empty string, written as ''.
Here is another example - a string that represents a floating-point value cannot be converted directly with int:
>>> int('55063.000000')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '55063.000000'
Instead, convert to float first:
>>> int(float('55063.000000'))
55063
See：https://www.geeksforgeeks.org/python-int-function/

The following work fine in Python:
>>> int('5') # passing the string representation of an integer to `int`
5
>>> float('5.0') # passing the string representation of a float to `float`
5.0
>>> float('5') # passing the string representation of an integer to `float`
5.0
>>> int(5.0) # passing a float to `int`
5
>>> float(5) # passing an integer to `float`
5.0
However, passing the string representation of a float, or any other string that does not represent an integer (including, for example, an empty string like '') will cause a ValueError:
>>> int('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''
>>> int('5.0')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '5.0'
To convert the string representation of a floating-point number to integer, it will work to convert to a float first, then to an integer (as explained in #katyhuff's comment on the question):
>>> int(float('5.0'))
5

int cannot convert an empty string to an integer. If the input string could be empty, consider either checking for this case:
if data:
as_int = int(data)
else:
# do something else
or using exception handling:
try:
as_int = int(data)
except ValueError:
# do something else

Python will convert the number to a float. Simply calling float first then converting that to an int will work:
output = int(float(input))

This error occurs when trying to convert an empty string to an integer:
>>> int(5)
5
>>> int('5')
5
>>> int('')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''

The reason is that you are getting an empty string or a string as an argument into int. Check if it is empty or it contains alpha characters. If it contains characters, then simply ignore that part.

Given floatInString = '5.0', that value can be converted to int like so:
floatInInt = int(float(floatInString))

You've got a problem with this line:
while file_to_read != " ":
This does not find an empty string. It finds a string consisting of one space. Presumably this is not what you are looking for.
Listen to everyone else's advice. This is not very idiomatic python code, and would be much clearer if you iterate over the file directly, but I think this problem is worth noting as well.

My simple workaround to this problem was wrap my code in an if statement, taking advantage of the fact that an empty string is not "truthy":
Given either of these two inputs:
input_string = "" # works with an empty string
input_string = "25" # or a number inside a string
You can safely handle a blank string using this check:
if input_string:
number = int(input_string)
else:
number = None # (or number = 0 if you prefer)
print(number)

I recently came across a case where none of these answers worked. I encountered CSV data where there were null bytes mixed in with the data, and those null bytes did not get stripped. So, my numeric string, after stripping, consisted of bytes like this:
\x00\x31\x00\x0d\x00
To counter this, I did:
countStr = fields[3].replace('\x00', '').strip()
count = int(countStr)
...where fields is a list of csv values resulting from splitting the line.

This could also happen when you have to map space separated integers to a list but you enter the integers line by line using the .input().
Like for example I was solving this problem on HackerRank Bon-Appetit, and the got the following error while compiling
So instead of giving input to the program line by line try to map the space separated integers into a list using the map() method.

your answer is throwing errors because of this line
readings = int(readings)
Here you are trying to convert a string into int type which is not base-10. you can convert a string into int only if it is base-10 otherwise it will throw ValueError, stating invalid literal for int() with base 10.

This seems like readings is sometimes an empty string and obviously an error crops up.
You can add an extra check to your while loop before the int(readings) command like:
while readings != 0 or readings != '':
readings = int(readings)

I am creating a program that reads a
file and if the first line of the file
is not blank, it reads the next four
lines. Calculations are performed on
those lines and then the next line is
read.
Something like this should work:
for line in infile:
next_lines = []
if line.strip():
for i in xrange(4):
try:
next_lines.append(infile.next())
except StopIteration:
break
# Do your calculation with "4 lines" here

Another answer in case all of the above solutions are not working for you.
My original error was similar to OP: ValueError: invalid literal for int() with base 10: '52,002'
I then tried the accepted answer and got this error: ValueError: could not convert string to float: '52,002' --this was when I tried the int(float(variable_name))
My solution is to convert the string to a float and leave it there. I just needed to check to see if the string was a numeric value so I can handle it correctly.
try:
float(variable_name)
except ValueError:
print("The value you entered was not a number, please enter a different number")

Yet another person who can't figure how to use unicode in Python

I know there is tons of question about this, but somehow I could not find a solution to my problem (in python3) :
toto="//\udcc3\udca0"
fp = open('cool', 'w')
fp.write(toto)
I get:
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 2: surrogates not allowed
How can I make it work?
Some precision: the string "//\udcc3\udca0" is given to me and I have no control over it. '\udcc3\udca0' is supposed to represent the character 'à'.

'\udcc3\udca0' is supposed to represent the character 'à'
The proper way to write 'à' using Python Unicode escapes is '\u00E0'. Its UTF-8 encoding is b'\xc3\xa0'.
It seems that whatever process produced your string was trying to use the UTF-8 representation, but instead of properly converting it to a Unicode string, it put the individual bytes in the U+DCxx range used by Python 3's surrogateescape convention.
>>> 'à'.encode('UTF-8').decode('ASCII', 'surrogateescape')
'\udcc3\udca0'
To fix the string, invert the operations that mangled it.
toto="//\udcc3\udca0"
toto = toto.encode('ASCII', 'surrogateescape').decode('UTF-8')
# At this point, toto == '//à', as intended.
fp = open('cool', 'w')
fp.write(toto)

How to replace hex value in a string

While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).
I want to replace them with specific characters, but am unable to do so. Removing them won't work either.
What it looks like in the exported flat file: https://i.imgur.com/7MQpoMH.png
Another example: https://i.imgur.com/3ZUSGIr.png
This is what I've tried:
(and mind, <0x01> represents a none-editable entity. It's not recognized here.)
import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
s=p.read()
# included in case it bears any significance
import re
import binascii
s = "Some string with hex: <0x01>"
s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte
s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')
or something along these lines in hopes to get a grasp of it while iterating through the whole string:
for x in s:
try:
base64.encodebytes(x)
base64.decodebytes(x)
s.strip(binascii.unhexlify(x))
s.decode('utf-8')
s.encode('latin1').decode('utf-8')
except:
pass
Nothing seems to get the job done.
I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing?
NB: I have to preserve umlauts (äöüÄÖÜ)
-- edit:
Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?
with io.open('out.txt', 'w', encoding="utf-8") as temp:
temp.write(s)

Judging from the images, these are actually control characters.
Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation.
You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.
In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits.
The fragment from the first image is probably equal to the following string:
"sich z\x01 B. irgendeine"
Your attempts to remove them were close.
s = s.replace('\x01', '.') should work.

Why is this error appearing?

AttributeError: 'builtin_function_or_method' object has no attribute 'encode'
I'm trying to make a text to code converter as an example for an assignment and this is some code based off of some I found in my research,
import binascii
text = input('Message Input: ')
data = binascii.b2a_base64.encode(text)
text = binascii.a2b_base64.encode(data)
print (text), "<=>", repr(data)
data = binascii.b2a_uu(text)
text = binascii.a2b_uu(data)
print (text), "<=>", repr(data)
data = binascii.b2a_hqx(text)
text = binascii.a2b_hqx(data)
print (text), "<=>", repr(data)
can anyone help me get it working? it's supposed to take an input in and then convert it into hex and others and display those...
I am using Python 3.6 but I am also a little out of practice...

TL;DR:
data = binascii.b2a_base64(text.encode())
text = binascii.a2b_base64(data).decode()
print (text, "<=>", repr(data))
You've hit on a common problem in the Python3 - str object vs bytes object. The bytes object contains sequence of bytes. One byte can contain any number from 0 to 255. Usually those number are translated through the ASCII table into a characters like english letters. Usually in the Python you should use bytes for working with binary data.
On the other hand the str object contains sequence of code points. One code point usually represent one character printed on your screen when you call print. Internally it is sequence of bytes so the Chinese symbol 的 is internally saved as 3 bytes long sequence.
Now to the your problem. The function requires as input the bytes object but you've got a str object from the function input. To convert str into bytes you have to call str.encode() method on the str object.
data = binascii.b2a_base64(text.encode())
Your original call binascii.b2a_base64.encode(text) means call method encode of the object binascii.b2a_base64 with parameter text.
The function binascii.b2a_base64 returns bytes contains original input encoded with the base64 algorithms. Now to get back the original str from encoded data you have to call this:
# Take base64 encoded data and return it decoded as bytes object
decoded_data = binascii.a2b_base64(data)
# Convert bytes object into str
text = decoded_data.decode()
It can be written as one line
decoded_data = binascii.a2b_base64(data).decode()
WARNING: Your call of print is invalid for Python 3 (it will work only in the python console)

Getting error while encoding str in UTF8 format

This is working code.
line = 'line'
another_line = 'new ' + line
another_line.encode('utf-8')
output
b'new line'
Now I'm trying to figure out why I'm getting the error for below code in python3 vs I'm getting concatenated string in python2 ?.
line = 'line'
'new '+line.encode('utf-8')
TypeError: Can't convert 'bytes' object to str implicitly

As the error states, Python3 will not automatically convert a byte type to a string (the + operator sees a string first so wants a string on the right as well) implicitly (automatically) so you need to tell it explicitly to do so.
line = 'line'
print('new '+str(line.encode('utf-8')))
note that this gives slightly different output.
If you want the exact same output then this works:
line = 'line'
print('new '.encode('utf-8')+line.encode('utf-8'))
From the docs
"The + (addition) operator yields the sum of its arguments. The arguments must either both be numbers or both be sequences of the same type. In the former case, the numbers are converted to a common type and then added together. In the latter case, the sequences are concatenated." and "Python evaluates expressions from left to right. Notice that while evaluating an assignment, the right-hand side is evaluated before the left-hand side."

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to handle list that contains emoji in Python3 - python-3.x

TypeError: ord() takes exactly one argument (2 given) I think the error is self-explanatory. That function takes one argument, but you are passing it two: ":{}" "#08x" Here some docs to read in case you need.

Related

Type Conversion to Integer in Python using int() [duplicate]

Yet another person who can't figure how to use unicode in Python

How to replace hex value in a string

Why is this error appearing?

Getting error while encoding str in UTF8 format

Categories

Resources