I've got a bit of an annoying issue, I'm trying to write a series of json data to a text document, however, python raises a UnicodeEncodeError whenever it encounters these kinds of characters.
As per the big update with python 3, these characters print to the console just fine, its the issue when we go
with open("filename.txt", "a") as file
file.write("I ♥ ice cream")
file.close()
As I'm still a newbie to python, I haven't the slightest clue how to solve this, any ideas?
Found out how to solve this one!
First off I'd like to thank #JJJ, for hinting me down the right track, however my only criticism is that the presented solution wasn't very straight forward, and for someone with no knowledge of the significance of bytes and strings this may present quite the challenge.
Basically the problem was to do with the default method of encoding that my computer uses (the OS being the standard win 10), being cp1252.
When going into python and having the program run a simple bit of code to test this, it more clearly illustrates the issue and thus we can find a more viable solution.
text = "I ♥ IceCream"
text = text.encode("cp1252")
open('People Jobs.txt','a').write(text)
Running this in IDLE, we get this:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2665' in position 2: character maps to <undefined>
Ah! Now we can see our issue! The codec can't encode the character! Knowing this we can encode the string using utf-8 before writing it to the file like so:
text = "I ♥ IceCream"
text = text.encode("utf-8")
open('People Jobs.txt','a').write(text)
Running this, we finally get:
b'I \xe2\x99\xa5 IceCream'
Which can be written to the file no worries. We can turn this back into the original message using the decode method, however for my purposes, we don't need to do that.
Once again, I'd like to extend my thanks to those who commented on my post, your extensive knowledge of the python language is quite the asset and I greatly appreciate it.
Hopefully my negligence to see these simple programming principles will benefit others when they come to laugh at this post
But hey that's why I have the name!
So until next time,
Mr Incompetent
P.S #Pratik K Thank you for the reminder of how to write this one in a more compact manner, I appreciate it :) (been doing C++ for a while so I've forgotten about python)
I tried it out, seems to work just fine. Just a side note, instead of using with ..., you can just as well write this out as: open(filename, 'a' ).write(string). This will open up the file, write/ append to it, and close it, all within one line.
Just to be clear, the syntax for your solution would be:
with open("filename.txt", 'a') as file:
file.write("I ♥ ice cream")
Related
Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?
You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.
The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.
My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.
I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr
All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).
I have an input file whose data I need to process. The file is in UTF-16 even though every single character in it is just a standard ascii character.
I can NOT change the input file so that it doesn't use useless double byte characters to represent 100% English language single character data. I need to convert this in python, on Windows. (Please, no non-python solutions, thank you).
I want my python program to act on these strings and output a file which is NOT double-byte. I just want standard ascii strings (one byte per character)
I've googled a lot, see all sorts of related questions, but not mine. I'm frustrated with not being able to solve this seemingly very simple question and need.
EDIT: Here is the program I got to work. It is absurd. There must be an easier way. The chr(10) references in the code is because the input has lines and I couldn't find a nonabsurd way to do simple readline/writeline calls.
with open('Unicode.txt','r') as input:
with open('ASCII.txt','w') as output:
for line in input.readlines():
codelist=[code for code in line.encode('ascii','ignore') if code not in (0,10)]
if codelist:
output.write(''.join([chr(code) for code in codelist]+[chr(10)]))
Question solved after reading a hint from #Mark Ransom.
with open('unicode.txt','r',encoding='UTF-16') as input:
with open('ascii.txt','w',encoding='ascii') as output:
output.write(input.read())
EDIT: I tried the code lines in the link to other questions that were similar, however the programs did not execute correctly
I am a full-on noob trying to complete some free online resources for self improvement and learning. I am using University of Waterloo's 'Python from scratch' and CS circles course I have tried to answer this question and cannot seem to:
Write a program that asks the user for a string and then prints the string in upper case.
I have tried:
print (str(input()).upper)
AS WELL AS
text = input()
print (text.upper)
AND
print(input().upper())
all programs run, but dont have correct output so I dont know what I am missing here. It's likely obvious and I may feel foolish
I would love to learn and move on, thanks for any assistance!
this is 'Python from scratch' 2.11 problem 'g' (7th problem in set)
You were very close, the following works:
input.upper()
so, print(input.upper())
should work for you.
text=input()
print(text.upper())
print(input().upper())
This should have worked for you in Python 3.x
I'm using Python 3.5 and import a text file as follows
with open(fn) as f:
data = f.read()
I then notice that there's a space between the minus sign and the digits of a negative number (e.g. \n\t- 2.51\t). I have tried to close the gap by writing
data.replace('- ','-'), but nothing happens. Oddly enough, this works like a charm in a Python console, but not in code. How can I solve this problem?
Is this a Unicode issue? Is it possible that the - I type on my keyboard is different from the - in the file? If so, how can I tell the two -'s apart?
Thanks in advance for your assistance
Thomas Philips
I made an elementary error, and wrote
data.replace('- ','-'),
when I should have written
data = data.replace('- ','-').
As soon as I did this, the problem solved itself.
i met interesting issue when im comparing two strings. Im reading data from file and everything works well. But then co-worker send me input file, which is just CTRL+C and CTRL+V of working file. And then miracle happend! VBA is so confused, that cant compare two simple strings and i fell of chair.
If you take a look at image you can see that comparison passed if condition where are two same strings, but it should not. Im a bit confused how this can happen.
So met someone something like this? Im realy start thinking about something like machine revolution from Terminator. (files are both saved in notepad++ and there are no strange characters or something like that)
Progress update
So i tried hints from guys in comments below. and ended with something like this
If CStr(Trim(rowArray(4))) <> (CStr("N/A")) Then
Contentent of rowArray(4) is still "N/A" string as on picture above and excel still thinks this strings arent same. I also saved file in pspad, netbeans, and normal notepad and issue is still same.
Use the immediate window to test the contents of the variable:
For i = 1 To Len(rowArray(4)): Print Asc(Mid(rowArray(4), i, 1)): Next
This will print the ASCII value of each character in the string - you can use this to determine what the extra character(s) are causing the issue.