Search ’, Â, � etc... How to fix strange encoding characters in python - python-3.x

I tried to retrieve data from Google+ using API. When I wrote data into csv file, I observed weird and strange characters like 😀😄😚😉😠’
After googling, I concluded this is an encoding issue.
To write retrieved data in a file, I used the following code:
file = open('filename, 'a', encoding='utf-8')
writer = csv.writer(file)
writer.writerow(values)
To check my terminal encoding, I used
import sys
sys.getdefaultencoding()
Output is: utf-8
Don't know where is the problem?

Your minimal, reproducible example appears overmuch minimal to be complete and verifiable. In any case, it looks like double mojibake:
value = "‘😀😄😚😉😠’" ### gathered from the question
print(value.encode('cp1252','backslashreplace').decode('utf-8','backslashreplace'))
‘😀😄😚😉😠’

Related

SPARK encoding issue while reading a csv with multiline=true option

I am stuck in an issue while trying to read a csv file with multiline=true option in spark that has characters like Ř and Á. The csv is read in utf-8 format ; But when we try to read the data by using multiline=true we get characters that are not equivalent to the ones that we had read. We get something like ŘÃ�. So essentially a word read as ZŘÁKO gets transformed to ZŘÃ�KO.I went through several other questions asked on stack overflow around the same issue but none of solution actually works !
I tried the following encodings while read/write operations : ‘US-ASCII’
‘ISO-8859-1’,‘UTF-8’,‘UTF-16BE’,‘UTF-16LE’,‘UTF-16’,SJIS and couple more but none of them could give me the expected result. But multiline=false generates the correct output somehow.
I cannot read/write the file as text as the current framework policy of project is around an ingestion framework where we read the file only once and then everything is expected to be done in-memory and I must use multiline as true.
I would really appreciate any thoughts on this matter. Thank You !
sample data:
id|name
1|ZŘÁKO
df=spark.read.format('csv').option('header',true).
option('delimter','|').option('multiline',true).option('encoding','utf-8').load()
df.show()
ouptut :
1|Z�KO
#trying to force utf-8 encoding as below :
df.withColumn("name", sql.functions.encode("name", 'utf-8'))
gives me this :
1|[22 5A c3..]
I tried the above steps with all the supported encodings in spark

Reading from a binary file and decoding using Python

I have a binary file from a mainframe which I'm trying to read using Python and produce a human readable text file. I'm still gathering more information about the file. What I do know is that the file serves as input to COBOL programs.
I try to read the file into python like this:
with open('P_MF.DAT', mode='rb') as f:
file_content = f.read(500)
When I print(file_content) I get something like:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00########\x00\x00###\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0\xf0####\x00\x00\x00\x00\x00\x00\x00\x00###\x00\x00\x00\x00######\xf0\xf0\xf0\xf4\x00\x00\x00\x00\x08\x02\x00\x00Q\x08c\x18\x1f\xc5###\x00\x00\x000\x00\x00\x0f\x00\x00\x00\x01\x11?\x00\x00\x10\x02F\x17o##\xd5#\xc9\xd5\xc7\xd3\xc9\xe2#\xd4\xc1\xd9\xe8#############################\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'
Then I tried this using the codecs module which also gives me gibberish:
import codecs
file_content1 = codecs.decode(file_content, 'cp500')
But I can see a few readable characters in the output when I print(file_content1):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00 000000000000000 \x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00 0004\x00\x00\x00\x00\x97\x02\x00\x00é\x97Ä\x18\x1fE \x00\x00\x00\x90\x00\x00\x0f\x00\x00\x00\x01\x11\x1a\x00\x00\x10\x02ã\x87? N INGLIS MARY \x00\x00\x00\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00'
I've been googling around for a couple of days. Tried a number of things like this - Python read a binary file and decode
I feel like I'm getting nowhere with this problem. I also plan to ask how this file looks if read in a mainframe. I'd appreciate any info/help/advice at this point.
​

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

PyPDF2 encoding issues

I'm having some trouble identifying why the output doesn't match the input of the PDF when pulling the text. And if there are any tricks I could do to fix this as it's not an isolated issue.
with open(file, 'rb') as f:
binary = PyPDF2.pdf.PdfFileReader(f)
text = binary.getPage(x).extractText()
print(text)
file: "I/O filters, 292–293"
output: "I/O Þlters, 292Ð293"
The Ð seems to represent all instances of '-' and Þ seems to be used for all instances of "fi".
I am using Windows CMD as my output for testing and I do know some characters don't show up right, but that leaves me baffled for something like the 'fi'
The text extraction of PyPDF2 was massively improved in versions 2.x. The whole project moved to pypdf.
I recommend you give it another try: https://pypdf.readthedocs.io/en/latest/user/extract-text.html
from pypdf import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

Python3:writing to a file ,I don't know which encoding to use?

I want to write every character to a file:
aa=0
i=100/2750
import sys
with open('chr.txt','w',encoding='utf-8') as f:
for a,b,c,d in zip(range(0,27500),range(27500,55000),range(55000,82500),range(82500,110000)):
f.write(chr(a))
f.write(chr(b))
f.write(chr(c))
f.write(chr(d))
aa+=1
sys.stdout.write('\r'+str(round(i*aa,2))+'%')
f.write(str(chr(110000)))
The problem is that using the above code only some characters are getting written.
So what should be the encoding?

Resources