How to encode Cyrillic characters in JSON - python-3.x

I want to read a JSON file containing Cyrillic symbols.
The Cyrillic symbols are represented like \u123.
Python converts them to '\\u123' instead of the Cyrillic symbol.
For example, the string "\u0420\u0435\u0433\u0438\u043e\u043d" should become "Регион", but becomes "\\u0420\\u0435\\u0433\\u0438\\u043e\\u043d".
encode() just makes string look like u"..." or adds a new \.
How do I convert "\u0420\u0435\u0433\u0438\u043e\u043d" to "Регион"?

If you want json to output a string that has non-ASCII characters in it then you need to pass ensure_ascii=False and then encode manually afterward.

Just use the json module.
import json
s = "\u0420\u0435\u0433\u0438\u043e\u043d"
# Generate a json file.
with open('test.json','w',encoding='ascii') as f:
json.dump(s,f)
# Reading it directly
with open('test.json') as f:
print(f.read())
# Reading with the json module
with open('test.json',encoding='ascii') as f:
data = json.load(f)
print(data)
Output:
"\u0420\u0435\u0433\u0438\u043e\u043d"
Регион

Related

Parsing a non-Unicode string with Flask-RESTful

I have a webhook developed with Flask-RESTful which gets several parameters with POST.
One of the parameters is a non-Unicode string, encoded in cp1251.
Can't find a way to correctly parse this argument using reqparse.
Here is the fragment of my code:
parser = reqparse.RequestParser()
parser.add_argument('text')
msg = parser.parse_args()
Then, I write msg to a text file, and it looks like this:
{"text": "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd !\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\n\n-- \n\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd."}
As you can see, Flask somehow replaces all Cyrillic characters with \ufffd. At the same time, non-Cyrillic characters, like ! or \n are processed correctly.
Anything I can do to advise RequestParser with the string encoding?
Here is my code for writing the text to disk:
f = open('log_msg.txt', 'w+')
f.write(json.dumps(msg))
f.close()
I tried f = open('log_msg.txt', 'w+', encoding='cp1251') with the same result.
Then, I tried
f = open('log_msg_ascii.txt', 'w+')
f.write(ascii(json.dumps(msg)))
Also, no difference.
So, I'm pretty sure it's RequestParser() tries to be too smart and can't understand the non-Unicode input.
Thanks!
Okay, I finally found a workaround. Thanks to #lenz for helping me with this issue. It seems that reqparse wrongly assumes that every string parameter comes as UTF-8. So when it sees a non-Unicode input field (among other Unicode fields!), it tries to load it as Unicode and fails. As a result, all characters are U+FFFD (replacement character).
So, to access that non-Unicode field, I did the following trick.
First, I load raw data using get_data(), decode it using cp1251 and parse with a simple regexp.
raw_data = request.get_data()
contents = raw_data.decode('windows-1251')
match = re.search(r'(?P<delim>--\w+\r?\n)Content-Disposition: form-data; name=\"text\"\r?\n(.*?)(?P=delim)', contents, re.MULTILINE | re.DOTALL)
text = match.group(2)
Not the most beautiful solution, but it works.

base64.encodebytes fails to insert newline chars

I must be missing something obvious. The function below successfully generates a base64 encoded string from an image file, but according to the docs, I expected it to have newlines every 76 characters.
def generateBase64(filein):
import base64
with open(filein, 'rb') as f:
return base64.encodebytes(f.read())
calling it on an image file (.png) thus: print(generateBase64(imgpath)) just returns one long string. What am I doing wrong?

Parse string representation of binary loaded from CSV [duplicate]

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
Now, the text data is stored like this:
"b'Lorem Ipsum\xc2\xa0Assignment '"
I tried to decode this using this code (there is more data in other columns, text is in 3rd column):
with open('data.csv','rt',encoding='utf-8') as f:
reader = csv.reader(f,delimiter=',')
for row in reader:
print(row[3])
But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[3]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
Edit: Here's a sample line from the csv file:
67783591545656656999,3415844,1450443669.0,b'Virginia School District Closes After Backlash Over Arabic Assignment: The Augusta County school district in\xe2\x80\xa6 | #abcde',52,18
Note: If the solution is in the encoding process, please note that I cannot afford to download the entire data again.
The easiest way is as below. Try it out.
import csv
from io import StringIO
byte_content = b"iam byte content"
content = byte_content.decode()
file = StringIO(content)
csv_data = csv.reader(file, delimiter=",")
If your input file really contains strings with Python syntax b prefixes on them, one way to workaround it (even though it's not really a valid format for csv data to contain) would be to use Python's ast.literal_eval() function as #Ry suggested — although I would use it in a slightly different manner, as shown below.
This will provide a safe way to parse strings in the file which are prefixed with a b indicating they are byte-strings. The rest will be passed through unchanged.
Note that this doesn't require reading the entire CSV file into memory.
import ast
import csv
def _parse_bytes(field):
"""Convert string represented in Python byte-string literal b'' syntax into
a decoded character string - otherwise return it unchanged.
"""
result = field
try:
result = ast.literal_eval(field)
finally:
return result.decode() if isinstance(result, bytes) else result
def my_csv_reader(filename, /, **kwargs):
with open(filename, 'r', newline='') as file:
for row in csv.reader(file, **kwargs):
yield [_parse_bytes(field) for field in row]
reader = my_csv_reader('bytes_data.csv', delimiter=',')
for row in reader:
print(row)
You can use ast.literal_eval to convert the incorrect fields back to bytes safely:
import ast
def _parse_bytes(bytes_repr):
result = ast.literal_eval(bytes_repr)
if not isinstance(result, bytes):
raise ValueError("Malformed bytes repr")
return result

Decode a Python string

Sorry for the generic title.
I am receiving a string from an external source: txt = external_func()
I am copying/pasting the output of various commands to make sure you see what I'm talking about:
In [163]: txt
Out[163]: '\\xc3\\xa0 voir\\n'
In [164]: print(txt)
\xc3\xa0 voir\n
In [165]: repr(txt)
Out[165]: "'\\\\xc3\\\\xa0 voir\\\\n'"
I am trying to transform that text to UTF-8 (?) to have txt = "à voir\n", and I can't see how.
How can I do transformations on this variable?
You can encode your txt to a bytes-like object using the encode-method of the str class.
Then this byte-like object can be decoded again with the encoding unicode_escape.
Now you have your string with all escape sequences parsed, but latin-1 decoded. You still have to encode it with latin-1 and then decode it again with utf-8.
>>> txt = '\\xc3\\xa0 voir\\n'
>>> txt.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')
'à voir\n'
The codecs module also has an undocumented funciton called escape_decode:
>>> import codecs
>>> codecs.escape_decode(bytes('\\xc3\\xa0 voir\\n', 'utf-8'))[0].decode('utf-8')
'à voir\n'

Python: Write to file diacritical marks as escape character sequence

I read text line from input file and after cut i have strings:
-pokaż wszystko-
–ყველას გამოჩენა–
and I must write to other file somethink like this:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
My python script start that:
file_input = open('input.txt', 'r', encoding='utf-8')
file_output = open('output.txt', 'w', encoding='utf-8')
Unfortunately, writing to a file is not what it expects.
I got tip why I have to change it, but cant figure out conversion:
Diacritic marks saved in UTF-8 ("-pokaż wszystko-"), it works correctly only if NLS_LANG = AMERICAN_AMERICA.AL32UTF8
If the output file has diacritics saved in escaping form ("-poka\017C wszystko-"), the script works correctly for any NLS_LANG settings
Python 3.6 solution...format characters outside the ASCII range:
#coding:utf8
s = ['-pokaż wszystko-','–ყველას გამოჩენა–']
def convert(s):
return ''.join(x if ord(x) < 128 else f'\\{ord(x):04X}' for x in s)
for t in s:
print(convert(t))
Output:
-poka\017C wszystko-
\2013\10E7\10D5\10D4\10DA\10D0\10E1 \10D2\10D0\10DB\10DD\10E9\10D4\10DC\10D0\2013
Note: I don't know if or how you want to handle Unicode characters outside the basic multilingual plane (BMP, > U+FFFF), but this code probably won't handle them. Need more information about your escape sequence requirements.

Resources