python error while reading large files from a folder to copy to another file - python-3.x

i'm trying to read files in folder and copy specific part of each file to a new file using the below python code.but getting error as below
import glob
file=glob.glob("C:/Users/prasanth/Desktop/project/prgms/rank_free1/*.txt")
fp=[]
for b in file:
fp.append(open(b,'r'))
s1=''
for f in fp:
d=f.read().split('\t')
rank=d[0]
appname=d[1]
appid=d[2]
s1=appid+'\n'
file=open('C:/Users/prasanth/Desktop/project/prgms/appids_file.txt','a',encoding="utf-8")
file.write(s1)
file.close()
im getting the following error message
enter code here
Traceback (most recent call last):
File "appids.py", line 8, in <module>
d=f.read().split('\t')
File "C:\Users\prasanth\AppData\Local\Programs\Python\Python36-
32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
12307: character maps to <undefined>

From what I can see one of the files you are opening contains non-UTF8 characters so it can't be read into a string variable without appropriate information about its encoding.
To handle this you need to open the file for reading in binary mode and take care of the problem in your script.
You may put d=f.read().split('\t') in a try: except: construct and reopen the file in binary mode in the except: branch. Then handle in your script the problem with non-UTF8 characters it contains.

Related

Piping binary data in apache spark spark

So I have an RDD of binary data I create it using
line = sc.binaryFiles("files/Videos",10)
line.map(lambda x:x[1]).pipe("cat").take(1)
I want to pipe this data to an external program but I get the following error
> Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 723, in pipe_objs
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 43: ordinal not in range(128)
Have any Idea how to fix this?
You didn't show us how you passed the data to an external program, but you probably want something along the lines of:
f.write(line.encode('utf8'))
You might prefer to have io.open() create that file handle f for you, using a suitable encoding=, such as 'utf8'.
Do consider moving up from python2 to python3 -- you'll get clearer diagnostics about when there's a missing encode or decode.

Error while trying to use pandas to read a csv

import pandas
df = pandas.read_csv("trial.csv")
The above code is used to read a simple csv file. But I keep getting the following error
File "C:\Users\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)
File "pandas\_libs\parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)
File "pandas\_libs\parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:12175)
File "pandas\_libs\parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas\_libs\parsers.c:14136)
File "pandas\_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens (pandas\_libs\parsers.c:14972)
File "pandas\_libs\parsers.pyx", line 1273, in pandas._libs.parsers.TextReader._convert_with_dtype (pandas\_libs\parsers.c:17119)
File "pandas\_libs\parsers.pyx", line 1289, in pandas._libs.parsers.TextReader._string_convert (pandas\_libs\parsers.c:17347)
File "pandas\_libs\parsers.pyx", line 1524, in pandas._libs.parsers._string_box_utf8 (pandas\_libs\parsers.c:23041)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 43: invalid continuation byte
Hi sorry I am so late to this, please change your code to the below and see if that works.
import pandas
df = pandas.read_csv("trial.csv", encoding="ISO-8859-1")
import pandas
df = pandas.read_csv("trial.csv", "rb")
if none of the suggestions above worked, "rb" read binary might do the trick
Your parser is trying to parse utf-8 data, but your file seems to be in another encoding (or there could just be an invalid character).
Try to instruct the parser to parse as plain ascii, perhaps with some codepage (I don't know Python, so can't help with that).
Looks like you need to use the encoding parameter.
Here is the list with possible encodings.
store=pd.read_csv('Super_Store.csv', encoding='windows-1252')
We just need to tell Python the actual encoding of this file. After some trail and error, I figured out that it was in windows-1252 encoding.
This is probably because these files were saved on a Windows computer at some point and this was the default character encoding for that computer.
For details go to :
HTML Windows-1252 (ANSI) Reference

Python reading from non ascii file

I have a text file which contains the following character:
ΓΏ
When I try and read the file in I've tried both:
with open (file, "r") as myfile:
AND
with codecs.open(file, encoding='utf-8') as myfile:
with success. However when I try to read the file in as a string using:
file_string=myfile.read()
OR
file_string=myfile.readLine()
I keep getting this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 11889: invalid start byte
Ideally I want it to ignore the character or subsitute it with '' or whitespace
I've come up with a solution. Just use python2 instead of python3. I still can't seem to get it to work in python3 though

Python opening a txt file converted from pdf

I downloaded from http://icdept.cgaux.org/pdf_files/English-Italian-Glossary-Nautical-Terms.pdf the pdf file and converted it to a txt file using pdf2txt ( downloaded from iTunes) I am trying to convert the contents of the file to a searchable Python dictionary(I am studying for an Italian sailing licence).
I am using simply to test whether I can get the text into a format that I can parse :
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file:
in_file.write(line)
but constantly get an error:
Traceback (most recent call last):
File "/Users/admin/Desktop/untitled folder/nautical.py", line 4, in <module>
for line in out_file:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
I would appreciate some help understanding the error and a suggestion to resolve the problem.
I am not sure whether someone can suggest an obvious way to parse this particular file into a dictionary format?
This error tells you that the coding of the file is not the expected. See on wikipedia about it. In other words, he doesn't know what does 0xfe mean.
You should find the correct encoding of the file and open with it. I suspect it is utf-8, but I could be wrong. Did you tried to open the file to see how it is?
Read this and try this:
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file.readlines():
in_file.write(line)

chardet in python3 and unknown file encoding

I use chardet for recognize my file encoding, but this error happend :
fh= open("file", mode="r")
sc= chardet.detect(fh)
Traceback (most recent call last):
File "/home/alireza/test.py", line 19, in <module>
sc= chardet.detect(fh)
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 24, in detect
u.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 65, in feed
aLen = len(aBuf)
TypeError: object of type '_io.TextIOWrapper' has no len()
and i can't open file with out know the encoding,
fh= open("file", mode="r").read()
sc= chardet.detect(fh)
Traceback (most recent call last):
File "/home/alireza/workspacee/makecdown/test.py", line 21, in <module>
fh= open("910.srt", mode="r").read()
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 34: invalid continuation byte
how to use chardet without open file ?! or any way to find out file encoding after/before opening ?
Try opening the file like this
fh= open("file", mode="rb")
Command Line Tool
If this does not work, try the command line tool of chardet.
Description from https://github.com/erikrose/chardet:
chardet comes with a command-line script which reports on the
encodings of one or more files:
% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
Not a direct answer, but you can find the description how it works in Python 3 here http://getpython3.com/diveintopython3/case-study-porting-chardet-to-python-3.html. After studying that, you may find the way how to detect another specific encoding.
The code was initially derived from Mozilla Seamonkey. You may find more information also there. Or look for some more advanced Python package related to Seamonkey.

Resources