Why some emojis are not converted back into their representation? - python-3.x

I am working on emoji detection module. For some emojis I am observing weird behavior that is after converting them to utf-8 encoding they are not converted back to their original representation form. I need their exact colored representation to be send as API response instead of sending unicode escaped string. Any leads?
In [1]: x = "example1: 🤭 and example2: 😁 and example3: 🥺"
In [2]: x.encode('utf8')
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'
In [3]: x.encode('utf8').decode('utf8')
Out[3]: 'example1: \U0001f92d and example2: 😁 and example3: \U0001f97a'
In [4]: print( x.encode('utf8').decode('utf8') )
*example1: 🤭 and example2: 😁 and example3: 🥺*
Link Emoji used in example
Update 1:
By this example it must be much clearer to explain. Here, two emojis are rendered when I have send unicode escape string, but 3rd exampled failed to convert exact emoji, what to do in such case?

'\U0001f92d' == '🤭' is True. It is an escape code but is still the same character...Two ways of display/entry. The former is the repr() of the string, printing calls str(). Example:
>>> s = '🤭'
>>> print(repr(s))
'\U0001f92d'
>>> print(str())
🤭
>>> s
'\U0001f92d'
>>> print(s)
🤭
When Python generates the repr() it uses an escape code representation if it thinks the display can't handle the character. The content of the string is still the same...the Unicode code point.
It's a debug feature. For example, is the white space spaces or tabs? The repr() of the string makes it clear by using \t as an escape code.
>>> s = 'a\tb'
>>> print(s)
a b
>>> s
'a\tb'
As to why an escape code is used for one emoji and not another, it depends on the version of Unicode supported by the version of Python used.
Pyton 3.8 uses Unicode 9.0, and one of your emoji isn't defined at that version level:
>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('😁')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('🤭')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name

Related

How to convert iso 8859-1 in simple letter using python

I m trying to clean my sqlite database using python. At first I loaded using this code:
import sqlite3, pandas as pd
con = sqlite3.connect("DATABASE.db")
import sqlite3, pandas as pd
df = pd.read_sql_query("SELECT TITLE from DOCUMENT", con)
So I got the dirty words. for example this "Conciliaci\363n" I want to get "Conciliacion". I used this code:
df['TITLE']=df['TITle'].apply(lambda x: x.decode('iso-8859-1').encode('utf8'))
I got b'' in blank cells. and got 'Conciliaci\\363n' too. So maybe I'm doing wrong. how can I solve this problem. Thanks in advance.
It's unclear, but if your string contains a literal backslash and numbers like this:
>>> s= r"Conciliaci\363n" # A raw string to make a literal escape code
>>> s
'Conciliaci\\363n' # debug display of string shows an escaped backslash
>>> print(s)
Conciliaci\363n # printing prints the escape
Then this will decode it correctly:
>>> s.encode('ascii').decode('unicode-escape') # convert to byte string, then decode
'Conciliación'
If you want to lose the accent mark as your question shows, then decomposing the Unicode string, converting to ASCII ignoring errors, then converting back to a Unicode string will do it:
>>> s2 = s.encode('ascii').decode('unicode-escape')
>>> s2
'Conciliación'
>>> import unicodedata as ud
>>> ud.normalize('NFD',s2) # Make Unicode decomposed form
'Conciliación' # The ó is now an ASCII 'o' and a combining accent
>>> ud.normalize('NFD',s2).encode('ascii',errors='ignore').decode('ascii')
'Conciliacion' # accent isn't ASCII, so is removed

Python getting a url request to work with special characters [duplicate]

If I do
url = "http://example.com?p=" + urllib.quote(query)
It doesn't encode / to %2F (breaks OAuth normalization)
It doesn't handle Unicode (it throws an exception)
Is there a better library?
Python 2
From the documentation:
urllib.quote(string[, safe])
Replace special characters in string
using the %xx escape. Letters, digits,
and the characters '_.-' are never
quoted. By default, this function is
intended for quoting the path section
of the URL.The optional safe parameter
specifies additional characters that
should not be quoted — its default
value is '/'
That means passing '' for safe will solve your first issue:
>>> urllib.quote('/test')
'/test'
>>> urllib.quote('/test', safe='')
'%2Ftest'
About the second issue, there is a bug report about it. Apparently it was fixed in Python 3. You can workaround it by encoding as UTF-8 like this:
>>> query = urllib.quote(u"Müller".encode('utf8'))
>>> print urllib.unquote(query).decode('utf8')
Müller
By the way, have a look at urlencode.
Python 3
In Python 3, the function quote has been moved to urllib.parse:
>>> import urllib.parse
>>> print(urllib.parse.quote("Müller".encode('utf8')))
M%C3%BCller
>>> print(urllib.parse.unquote("M%C3%BCller"))
Müller
In Python 3, urllib.quote has been moved to urllib.parse.quote, and it does handle Unicode by default.
>>> from urllib.parse import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
>>> quote('/El Niño/')
'/El%20Ni%C3%B1o/'
I think module requests is much better. It's based on urllib3.
You can try this:
>>> from requests.utils import quote
>>> quote('/test')
'/test'
>>> quote('/test', safe='')
'%2Ftest'
My answer is similar to Paolo's answer.
If you're using Django, you can use urlquote:
>>> from django.utils.http import urlquote
>>> urlquote(u"Müller")
u'M%C3%BCller'
Note that changes to Python mean that this is now a legacy wrapper. From the Django 2.1 source code for django.utils.http:
A legacy compatibility wrapper to Python's urllib.parse.quote() function.
(was used for unicode handling on Python 2)
It is better to use urlencode here. There isn't much difference for a single parameter, but, IMHO, it makes the code clearer. (It looks confusing to see a function quote_plus! - especially those coming from other languages.)
In [21]: query='lskdfj/sdfkjdf/ksdfj skfj'
In [22]: val=34
In [23]: from urllib.parse import urlencode
In [24]: encoded = urlencode(dict(p=query,val=val))
In [25]: print(f"http://example.com?{encoded}")
http://example.com?p=lskdfj%2Fsdfkjdf%2Fksdfj+skfj&val=34
Documentation
urlencode
quote_plus
An alternative method using furl:
import furl
url = "https://httpbin.org/get?hello,world"
print(url)
url = furl.furl(url).url
print(url)
Output:
https://httpbin.org/get?hello,world
https://httpbin.org/get?hello%2Cworld

Read data with multi delimiter in pyspark

I have a input file which looks like this and has "|" as multi-delimiter :
162300111000000000106779"|"2005-11-16 14:12:32.860000000"|"1660320"|"0"|"2005-11-16 14:12:32.877000000"|""|""|""|""|""|""|""|"False"|"120600111000000000106776```
I can read this type of record with UDF as below :
inputDf = glueContext.sparkSession.read.option("delimiter", input_file_delimiter,)
.csv("s3://" + landing_bucket_name + "/" + input_file_name)
udf = UserDefinedFunction(lambda x: re.sub('"', '', str(x)))
new_df = inputDf.select(*[udf(column).alias(column) for column in inputDf.columns])
but when i get the input file as
000/00"|"AE71501"|"Complaint for Attachment of Earnings Order"|"In accordance with section test of the Attachment of Test Act Test."|"Non-Test"|"Other non-test offences"|"N"|"Other Non-Test"|"Non-Test
I am getting below exception while reading it, using the same UDF, my code fails at exact same location where i have mu UDF :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 66: ordinal not in range(128)
Any help on below will be great :
- Optimized code to read both type of files , considering "|" as separator.
- How my existing UDF can handle the second type of input records.
This is likely caused by running in Python 2.x which has two separate types for string-like objects (unicode strings and non-unicode strings, which are nowadays simply byte sequences).
Spark will read in your data (which are bytes, as there is no such thing as plain text), and decode the lines as a sequence of Unicode strings. When you call str on a Unicode string that has a codepoint that is not in the ASCII range of codepoints, Python 2 will produce an error:
# python2.7>>> unicode_str = u"ú"
>>> type(unicode_str)
<type 'unicode'>
>>> str(unicode_str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 0: ordinal not in range(128)
The recommended path is that you work with Unicode strings (which is the default string object in Python 3) all throughout your program, except at the point where you either read/receive data (where you should provide a suitable encoding scheme, so that you can decode the raw bytes) and at the point where you write/send data (again, where you use an encoding to encode the data as a series of bytes). This is called “the Unicode sandwich”.
Many libraries, including Spark, already decode bytes and encode unicode strings for you. If you simply remove the call to str in your user defined function, your code will likely work:
#pyspark shell using Python 2.7
>>> spark.sparkContext.setLogLevel("OFF") # to hide the big Py4J traceback that is dumped to the console, without modifying the log4j.properties file
>>> from py4j.protocol import Py4JJavaError
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> df = spark.read.csv("your_file.csv", sep="|")
>>> def strip_double_quotes_after_str_conversion(s):
... import re
... return re.sub('"', '', str(s))
...
>>> def strip_double_quotes_without_str_conversion(s):
... import re
... return re.sub('"', '', s)
...
>>> df.select(*[udf(strip_double_quotes_without_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
|037/02|TH68150|Aggravated vehicl...|Contrary to secti...|Theft of motor ve...|Vehicle offences| Y|Aggravated Vehicl...|37.2|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
>>> try:
... df.select(*[udf(strip_double_quotes_after_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
... except Py4JJavaError as e:
... print("That failed. Root cause: %s" % e.java_exception.getCause().getMessage().rsplit("\n", 2)[-2])
...
That failed. Root cause: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 78: ordinal not in range(128)
So, the solution to the experienced problem is simple: don’t use str in your UDF.
Note that Python 2.x will no longer be maintained as of January 1st 2020. You’d do well transitioning to Python 3.x before that. In fact, had you executed this in a Python 3 interpreter, you would not have experienced the issue at all.

How to use Python 3 to import Excel files with CSV extensions saved as Unicode Text?

As output from another service, I have received many Excel files that have a
".csv" filename extension but are saved as "Unicode Text (*.txt)".
To my knowledge they don't have an Unicode characters so I am not worried about data loss and if I was producing the data I would not have saved it this way. However, I need to process 100s of these files and I have been unable to import them using Python (specifically Python 3).
I have tried many different options such as the
csv module, pandas.read_csv(), and pyexcel.get_sheet()
to import the Excel file directly without any sucess. Often with errors such as
"... can't decode byte 0xff in position 0: invalid start byte".
I can manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" file type which can then be imported (e.g. pyexcel.get_sheet() ), but can't figure out how to do this programmatically.
I can also manually open the original file in Notepad and save it as a text file with a ".txt" extension and ANSI encoding, which allows me to import the data using numpy.loadtxt(). This isn't ideal because it is also manual. Additionally, I don't know why I need to convert to ANSI encoding and can't use UTF-8 for it to be read using
numpy.loadtxt(file_name, encoding="UTF-8"),
which results in an error such as
"... ValueError: could not convert string to float: '\ufeff ..."
and the following error using just numpy.loadtxt(file_name)
"... ValueError: could not convert string to float: ' ..."
In summary, my main goal is to find a programmatic way to change the Excel file type/encoding to something I can import into Python 3 using existing packages with CSV support. Additionally, if someone has any idea why numpy.loadtxt can't import a text file with a UTF-8 encoding (but can for ANSI encoding) that would be great too! Any knowledge to help me understand the problem (or my misconception of the problem) is appreciated.
Your file has a UTF-8 BOM at the front. Python can strip this automatically with the utf_8_sig codec:
numpy.loadtxt(file_name, encoding="utf_8_sig"),
Excel's "Unicode Text" is UTF-16-encoded and tab-delimited. With the csv module, use:
import csv
with open('book1.txt',encoding='utf16') as f:
r = csv.reader(f,delimiter='\t')
for row in r:
print(row)
With the following Excel sheet saved as "Unicode Text" (Excel 2016):
This produces (Python 3.6):
['English', 'Chinese']
['Mark', '马克']
['American', '美国人']
Pandas also works (but not with utf16...it oddly needed the hyphen):
>>> import pandas as pd
>>> pd.read_csv('book1.txt',delimiter='\t',encoding='utf-16')
English Chinese
0 Mark 马克
1 American 美国人
Default dtype is float. genfromtxt uses nan when it can't parse the string:
In [8]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None)
Out[8]: array([nan, 2., 3.])
loadtxt raises an error:
In [9]: np.loadtxt(['one, 2, 3'], delimiter=',',encoding=None)
....
ValueError: could not convert string to float: 'one'
In [10]: np.loadtxt(['1, 2, 3'], delimiter=',',encoding=None)
Out[10]: array([1., 2., 3.])
genfromtxt with dtype=None:
In [12]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None,dtype=None)
Out[12]: array(('one', 2, 3), dtype=[('f0', '<U3'), ('f1', '<i8'), ('f2', '<i8')])
from your comment:
In [13]: np.genfromtxt('1\t2\n3\t4\n'.splitlines(), delimiter='\t',encoding=None
...: )
Out[13]:
array([[1., 2.],
[3., 4.]])
In [21]: print('1\t2\n3\t4\n')
1 2
3 4
Using encoding='utf8' works as well.

How to display chinese character in 65001 in python?

I am in win7 +python3.3.
import os
os.system("chcp 936")
fh=open("test.ch","w",encoding="utf-8")
fh.write("你")
fh.close()
os.system("chcp 65001")
fh=open("test.ch","r",encoding="utf-8").read()
print(fh)
Äã
>>> print(fh.encode("utf-8"))
b'\xe4\xbd\xa0'
How can i display the chinese character 你 in 65001?
If your terminal is capable of displaying the character directly (which it may not be due to font issues) then it should Just Work(tm).
>>> hex(65001)
>>> u"\ufde9"
'\ufde9'
>>> print(u"\ufde9")
﷩
To avoid the use of literals, note that in Python 3, at least, the chr() function will take a code point and return the associated Unicode character. So this works too, avoiding the need to do hex conversions.
>>> print(chr(65001))
﷩

Resources