Python reading from non ascii file - io

I have a text file which contains the following character:
ΓΏ
When I try and read the file in I've tried both:
with open (file, "r") as myfile:
AND
with codecs.open(file, encoding='utf-8') as myfile:
with success. However when I try to read the file in as a string using:
file_string=myfile.read()
OR
file_string=myfile.readLine()
I keep getting this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 11889: invalid start byte
Ideally I want it to ignore the character or subsitute it with '' or whitespace

I've come up with a solution. Just use python2 instead of python3. I still can't seem to get it to work in python3 though

Related

Cant Upload a CSV File in jupyter using python

I have tried
def loadCsv(filename):
lines=csv.reader(open(r'D:xxxivateNLPSearchingMaterialImplementationprojectpid.csv'))
it gives
IndentationError: expected an indented block
2nd I try
import pandas as pd
import os
os.chdir('D:\mxxx\NLP\SearchingMaterial\Implementation\project')
df = pd.read_csv('pid.csv')
print(df)
it gives
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 16-17: malformed \N character escape
This can occur with windows paths as the default directory includes backslash \ in the path and when python loads it as a string, we get a unicodeescape error as \u is a unicode escape in python. In order to make it work, you have to use two backslashes
'D:\\mxxx\\NLP\\SearchingMaterial\\Implementation\\project'

How can I download an image from my local directory?

Definitely, I'm going to do my PC crawling.
I want to get an image from an HTML document on my PC.
I tried this:
n=0
for i in soup.find_all('div', class_='c_img'):
with open('FILE DIRECTORY', 'r', encoding='utf-8') as f:
r=f.read()
with open(str(n)+'.jpg', 'wb', encoding='utf-8') as f:
f.write(r)
n+=1
And I got:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 5: invalid continuation byte
So I tried encoding='utf-16'
But it threw UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 44-45: illegal encoding
How can I make it? Thanks.
I believe the issue arises because you're attempting to encode a .jpg with utf-8.
You've posted only a small portion of your code, and I'm not sure what the other code does, but you should open the .jpg file as 'wb' without specifying an encoding.
If your "FILE DIRECTORY" file contains the .jpg, open it with 'rb' again, with no encoding.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-7: ordinal not in range(128)

I would like to rename all the files in a certain directory. The old filename with a relative path is 'full_fname', after detoxing the filenames is 'full_new_fname' as in the picture. I am working in a linux environment with Python 3.6 and using Jupyter notebook.
I use the following command to rename;
os.rename(full_fname,full_new_fname)
I get the error;
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-7: ordinal not in range(128)
How can I make this work? Thanks
Try this and see if it works:
os.rename(full_fname.encode('U8'), full_new_fname.encode('U8'))

python error while reading large files from a folder to copy to another file

i'm trying to read files in folder and copy specific part of each file to a new file using the below python code.but getting error as below
import glob
file=glob.glob("C:/Users/prasanth/Desktop/project/prgms/rank_free1/*.txt")
fp=[]
for b in file:
fp.append(open(b,'r'))
s1=''
for f in fp:
d=f.read().split('\t')
rank=d[0]
appname=d[1]
appid=d[2]
s1=appid+'\n'
file=open('C:/Users/prasanth/Desktop/project/prgms/appids_file.txt','a',encoding="utf-8")
file.write(s1)
file.close()
im getting the following error message
enter code here
Traceback (most recent call last):
File "appids.py", line 8, in <module>
d=f.read().split('\t')
File "C:\Users\prasanth\AppData\Local\Programs\Python\Python36-
32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
12307: character maps to <undefined>
From what I can see one of the files you are opening contains non-UTF8 characters so it can't be read into a string variable without appropriate information about its encoding.
To handle this you need to open the file for reading in binary mode and take care of the problem in your script.
You may put d=f.read().split('\t') in a try: except: construct and reopen the file in binary mode in the except: branch. Then handle in your script the problem with non-UTF8 characters it contains.

Python opening a txt file converted from pdf

I downloaded from http://icdept.cgaux.org/pdf_files/English-Italian-Glossary-Nautical-Terms.pdf the pdf file and converted it to a txt file using pdf2txt ( downloaded from iTunes) I am trying to convert the contents of the file to a searchable Python dictionary(I am studying for an Italian sailing licence).
I am using simply to test whether I can get the text into a format that I can parse :
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file:
in_file.write(line)
but constantly get an error:
Traceback (most recent call last):
File "/Users/admin/Desktop/untitled folder/nautical.py", line 4, in <module>
for line in out_file:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
I would appreciate some help understanding the error and a suggestion to resolve the problem.
I am not sure whether someone can suggest an obvious way to parse this particular file into a dictionary format?
This error tells you that the coding of the file is not the expected. See on wikipedia about it. In other words, he doesn't know what does 0xfe mean.
You should find the correct encoding of the file and open with it. I suspect it is utf-8, but I could be wrong. Did you tried to open the file to see how it is?
Read this and try this:
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file.readlines():
in_file.write(line)

Resources