Arabic text replaced with escape sequences when creating CSV files using python - python-3.x

I am trying to create a CSV file that contains Arabic tweets collected using tweepy for a project I am doing. All is fine gathering the data, however, when i am writing to the CSV file all Arabic results are escaped with \xXXXX sequences
as follows:
b'#\xd8\xa7\xd9\x84\xd9\x8a\xd9\x88\xd9\x85_\xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a_\xd9\x84\xd9\x84\xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd9\x87_2017 \xd8\xa7\xd9\x84\xd8\xa5\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xad\xd9\x82\xd9\x8a\xd9\x82\xd9\x8a\xd8\xa9 \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x81\xd9\x83\xd8\xb1 \xd9\x88\xd9\x84\xd9\x8a\xd8\xb3\xd8\xaa \xd8\xa7\xd8\xb9\xd8\xa7\xd9\x82\xd8\xa9
I looked at many previously asked questions and all I could find was suggestions for python 2 or answers similar to the one I am writing. When I was creating JSON files instead I was using ensure_ascii=False but I couldn't find anything similar for CSV. Below is my code:
with codecs.open('tweets.csv', 'a', encoding='utf-8') as file:
fieldnames = ['tweet', 'country']
writer = csv.DictWriter(file, fieldnames=fieldnames)
data = {'tweet': status.text, 'country': status.place.full_name}
writer.writerow(data)
I tried adding .encoding='utf-8' to status.text and status.place as well but that also didn't work. Any suggestions?

You have to make sure the Arabic string you have is decoded into UTF-8 before you write it. Assuming status.text is of type bytes you should type text=status.text.decode('utf-8'). (Maybe you have to do this for status.place.full_name too.) But if it's of type str then it won't have an decode() method. To avoid escape sequences in your file, a str object should be written anyway.
If you try to specify the encoding of a bytes object (like the one you presumably have) as 'utf-8' that won't work because the text is already in UTF-8 bytes. So in order to get UTF-8 characters you must call decode() on the bytes object. That way it writes the UTF-8 characters and not the UTF-8 bytes.

Related

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.
I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.
Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to:
1. encode my string into a byte corresponding to the encoding of the file
2. match the file and get the path of that file
I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'
with open(os.path.join(root, filename), mode='rb') as f:
text = f.read()
print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...
As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]
However, the funny thing is that if I literally convert the string into bytes, I get:
enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>
On the other hand, opening the file using
with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...
I get:
StaticText [¶]Online¹A³õ] €?‹ Œ î...
which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem
What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?
Thank you!
Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.
Note: guessing the encoding of a given byte string is always approximate.
There's no safe way of determining the encoding for sure.
If you have a byte string like
b'[\xb6]Online\xb9A\xb3\xf5]'
and you know it must translate (be decoded) into
'[跑Online農場]'
then what you can is trial and error with a few codecs.
I did this with the list of codecs supported by Python, searching for codecs for Chinese.
When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'
When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:
>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'
So: use CP-950 for reading the file.

How to search specific unicode tokens in the list in python

I want to search a specific string from the file for which the code is as follows
f1= codecs.open('brokenhindi.txt', encoding='utf-8')
for tokens in f1:
if u"राज्य" in tokens:
print 'done_3'
but it did not search the string(राज्य), if I replace राज्य with an english token then it searches it. I cannot find the error in the code.
Your code just works fine.
It seems though that your script or your text file is ascii encoded rather than utf-8.
Try to save it as utf-8 encoded

python3 the way to write string into file in its entirety

I am a newbie in Python3.
I have a question in writing a string into a file.
The below string is what I tried to write into a file.
ÀH \x10\x08\x81\x00 (in hex, c04820108810)
When I checked the file using xxd command, I could check there is a difference between the string and the file.
00000000: c380 4820 1008 c281 00 ..H .....
This is code I wrote.
s = 'ÀH \x10\x08\x81\x00'
with open('test', 'w') as f:
f.write(s)
The question is how can I write this string into file in its entirety.
It seems that you want to write binary data. In that case, you should use the bytes type instead of str as this gives you full control over the binary content of the sequence.
When dealing with strings, you have to take into account that Python will internally handle everything as UTF-8, so by the time you enter something like À, the file encoding will decide on what is actually entered. You can always encode() a string to look at its bytes:
>>> 'ÀH \x10\x08\x81\x00'.encode()
b'\xc3\x80H \x10\x08\xc2\x81\x00'
You can convert this to hex using the binascii module for a more readable hex string of those bytes:
>>> binascii.hexlify('ÀH \x10\x08\x81\x00'.encode())
b'c38048201008c28100'
As you can see, this is the same that was written to your file. So Python already does the correct thing. It’s just that the input is not what you want it to be.
So instead, use a bytes string and write to the file in binary mode:
# use a bytes string
s = b'\xc0\x48\x20\x10\x88\x10'
# open the file in binary mode
with open('test', 'bw') as f:
f.write(s)
Btw. if you look at the encoded string from the beginning, you can already see that you have a different encoding in mind than Python when you entered that string. You expected À to be 0xc0 in binary which is somewhat correct since that its Latin-1 representation. But when you lookup its other representations, you can see that in UTF-8, which is what Python uses by default, it is 0xc380 instead—which is again the value we got when encoding it in Python.
You have to setup coding style to utf-8 and also use raw strings because you have \ escape characters. So add coding style and put r before your string to make it raw.
# -*- coding: utf-8 -*-
s = r'ÀH \x10\x08\x81\x00'
with open('test.txt', 'w') as f:
f.write(s)

Decoding/Encoding using sklearn load_files

I'm following the tutorial here
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb
to learn about machine learning and text.
In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).
Here in the iPython Notebook I load my data just like they do:
tweets_train =load_files('Path to my training Tweets')
And then I try to fit them with CountVectorizer
vect = CountVectorizer().fit(text_train)
I get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position
561: invalid continuation byte
Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)
EDIT:
Code I use using Twython to download tweets:
def get_tweets(user):
twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
lis = user_timeline[0]['id']
lis = [lis]
for i in range(0, 16): ## iterate through all tweets
## tweet extract method with the last list item as the max_id
user_timeline = twitter.get_user_timeline(screen_name=user,
count=200, include_retweets=False, max_id=lis[-1])
for tweet in user_timeline:
lis.append(tweet['id']) ## append tweet id's
text = str(tweet['text']).replace("'", "")
text_file = open(user, "a")
text_file.write(text)
text_file.close()
You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding.
If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.
First, you need to find out what encoding was used to store the tweets on disk.
When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:
>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>
Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
There are other ways to find out what encoding was used for the files.
For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.
Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:
tweets_train = load_files('path to tweets', encoding='latin-1')
... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.

Python3 utf-8 decoding/encoding problems with data hiding

I'm trying to take the text from a file (the text is Russian), hide it in an image, and then later be able to retrieve it from the image. However, I keep getting binascii.Error: Odd-length string when I try to retrieve the data from the image I hid it in.
I feel like the problem may lie within what I use to hide the text. When I do someString = file.read() on the file, and print someString everything comes out fine. But when I run:
file = open(<text file path>, 'r', encoding='utf-8')
entireText = file.read()
print(codecs.encode(entireText,'utf-8'))
I get the following:
b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xbf\xd0\xb5\xd0\xb2:\n\xd0\x9e\xd1\x87\xd0\xb8 \xd1\x87\xd1\x91\xd1\x80\xd0\xbd\xd1\x8b\xd0\xb5, \xd0\xbe\xd1\x87\xd0\xb8 \xd0
That is only a piece of it, but the theme is shown; it has colons, spaces, commas, and \n all throughout the 'bytes' which is what type the codecs.encode returns. If i use codecs to decode it, then I get the original text back in perfect format.
if it helps, here are the functions I use to make it happen:
def stringToBinary(msg):
return bin(int(binascii.hexlify(msg.encode('utf-8')), 16))[2:]
def binaryToString(bNum):
return binascii.unhexlify('%x' % (int('0b' + bNum, 2))).decode('utf-8')
If that is not enough, the entire file is here: http://pastebin.com/f541DpzS
EDIT: I think I'm getting that issue because the image I'm trying to hide the text in didn't have enough pixels for me to hide the complete message, so it was trying to convery the binary number to a string without all of the bits, thus throwing binascii.Error: Odd-length string.

Resources