Piping binary data in apache spark spark - apache-spark

So I have an RDD of binary data I create it using
line = sc.binaryFiles("files/Videos",10)
line.map(lambda x:x[1]).pipe("cat").take(1)
I want to pipe this data to an external program but I get the following error
> Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 723, in pipe_objs
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 43: ordinal not in range(128)
Have any Idea how to fix this?

You didn't show us how you passed the data to an external program, but you probably want something along the lines of:
f.write(line.encode('utf8'))
You might prefer to have io.open() create that file handle f for you, using a suitable encoding=, such as 'utf8'.
Do consider moving up from python2 to python3 -- you'll get clearer diagnostics about when there's a missing encode or decode.

Related

Error while trying to use pandas to read a csv

import pandas
df = pandas.read_csv("trial.csv")
The above code is used to read a simple csv file. But I keep getting the following error
File "C:\Users\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)
File "pandas\_libs\parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)
File "pandas\_libs\parsers.pyx", line 989, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:12175)
File "pandas\_libs\parsers.pyx", line 1117, in pandas._libs.parsers.TextReader._convert_column_data (pandas\_libs\parsers.c:14136)
File "pandas\_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens (pandas\_libs\parsers.c:14972)
File "pandas\_libs\parsers.pyx", line 1273, in pandas._libs.parsers.TextReader._convert_with_dtype (pandas\_libs\parsers.c:17119)
File "pandas\_libs\parsers.pyx", line 1289, in pandas._libs.parsers.TextReader._string_convert (pandas\_libs\parsers.c:17347)
File "pandas\_libs\parsers.pyx", line 1524, in pandas._libs.parsers._string_box_utf8 (pandas\_libs\parsers.c:23041)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 43: invalid continuation byte
Hi sorry I am so late to this, please change your code to the below and see if that works.
import pandas
df = pandas.read_csv("trial.csv", encoding="ISO-8859-1")
import pandas
df = pandas.read_csv("trial.csv", "rb")
if none of the suggestions above worked, "rb" read binary might do the trick
Your parser is trying to parse utf-8 data, but your file seems to be in another encoding (or there could just be an invalid character).
Try to instruct the parser to parse as plain ascii, perhaps with some codepage (I don't know Python, so can't help with that).
Looks like you need to use the encoding parameter.
Here is the list with possible encodings.
store=pd.read_csv('Super_Store.csv', encoding='windows-1252')
We just need to tell Python the actual encoding of this file. After some trail and error, I figured out that it was in windows-1252 encoding.
This is probably because these files were saved on a Windows computer at some point and this was the default character encoding for that computer.
For details go to :
HTML Windows-1252 (ANSI) Reference

python error while reading large files from a folder to copy to another file

i'm trying to read files in folder and copy specific part of each file to a new file using the below python code.but getting error as below
import glob
file=glob.glob("C:/Users/prasanth/Desktop/project/prgms/rank_free1/*.txt")
fp=[]
for b in file:
fp.append(open(b,'r'))
s1=''
for f in fp:
d=f.read().split('\t')
rank=d[0]
appname=d[1]
appid=d[2]
s1=appid+'\n'
file=open('C:/Users/prasanth/Desktop/project/prgms/appids_file.txt','a',encoding="utf-8")
file.write(s1)
file.close()
im getting the following error message
enter code here
Traceback (most recent call last):
File "appids.py", line 8, in <module>
d=f.read().split('\t')
File "C:\Users\prasanth\AppData\Local\Programs\Python\Python36-
32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
12307: character maps to <undefined>
From what I can see one of the files you are opening contains non-UTF8 characters so it can't be read into a string variable without appropriate information about its encoding.
To handle this you need to open the file for reading in binary mode and take care of the problem in your script.
You may put d=f.read().split('\t') in a try: except: construct and reopen the file in binary mode in the except: branch. Then handle in your script the problem with non-UTF8 characters it contains.

Python opening a txt file converted from pdf

I downloaded from http://icdept.cgaux.org/pdf_files/English-Italian-Glossary-Nautical-Terms.pdf the pdf file and converted it to a txt file using pdf2txt ( downloaded from iTunes) I am trying to convert the contents of the file to a searchable Python dictionary(I am studying for an Italian sailing licence).
I am using simply to test whether I can get the text into a format that I can parse :
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file:
in_file.write(line)
but constantly get an error:
Traceback (most recent call last):
File "/Users/admin/Desktop/untitled folder/nautical.py", line 4, in <module>
for line in out_file:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 0: ordinal not in range(128)
I would appreciate some help understanding the error and a suggestion to resolve the problem.
I am not sure whether someone can suggest an obvious way to parse this particular file into a dictionary format?
This error tells you that the coding of the file is not the expected. See on wikipedia about it. In other words, he doesn't know what does 0xfe mean.
You should find the correct encoding of the file and open with it. I suspect it is utf-8, but I could be wrong. Did you tried to open the file to see how it is?
Read this and try this:
with open('English-Italian-Glossary-Nautical-Terms1.txt', 'r') as out_file:
with open("nautical_glossary.txt", 'w') as in_file:
for line in out_file.readlines():
in_file.write(line)

Segmentation fault (core dumped) while calling python script from NodeJS through spawn

I have python script which prints out long list through statistical R (by PypeR). This python script is working absolutely fine.
Now I am trying to run this script from NodeJS through spawn functionality of child_process but it fails with following error:-
Traceback (most recent call last):
File "pyper_sample.py", line 5, in <module>
r=R()
File "/home/mehtam/pyper.py", line 582, in __init__
'prog' : Popen(RCMD, stdin=PIPE, stdout=PIPE, stderr=return_err and _STDOUT or childstderr, startupinfo=info),
File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
./temp.sh: line 1: 27500 Segmentation fault (core dumped) python pyper_sample.py o1dn01.tsv cpu_overall
child process exited with code : 139
Note: My python script is working perfectly. I already tested it manually.
My python script is working perfectly. I already tested it manually.
The output clearly shows that OSError: No such file or directory exception happened during Popen() call.
It means that the program is not found e.g.,
>>> from subprocess import Popen
>>> p = Popen(["ls", "-l"]) # OK
>>> total 0
>>> p = Popen(["no-such-program-in-current-path"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Also, passing the whole command as a string instead of a list (shell=False by default) is a common error:
>>> p = Popen("ls -l")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Make sure:
your (child) program can be found in current $PATH
use a list argument instead of a string
test whether it works if you run it manually from a different working directory, different user, etc
Note: your Popen() call passes startupinfo that is Windows only. A string command with several arguments that would work on Windows fails with the "No such file or directory" error on Unix.

chardet in python3 and unknown file encoding

I use chardet for recognize my file encoding, but this error happend :
fh= open("file", mode="r")
sc= chardet.detect(fh)
Traceback (most recent call last):
File "/home/alireza/test.py", line 19, in <module>
sc= chardet.detect(fh)
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 24, in detect
u.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 65, in feed
aLen = len(aBuf)
TypeError: object of type '_io.TextIOWrapper' has no len()
and i can't open file with out know the encoding,
fh= open("file", mode="r").read()
sc= chardet.detect(fh)
Traceback (most recent call last):
File "/home/alireza/workspacee/makecdown/test.py", line 21, in <module>
fh= open("910.srt", mode="r").read()
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 34: invalid continuation byte
how to use chardet without open file ?! or any way to find out file encoding after/before opening ?
Try opening the file like this
fh= open("file", mode="rb")
Command Line Tool
If this does not work, try the command line tool of chardet.
Description from https://github.com/erikrose/chardet:
chardet comes with a command-line script which reports on the
encodings of one or more files:
% chardetect.py somefile someotherfile
somefile: windows-1252 with confidence 0.5
someotherfile: ascii with confidence 1.0
Not a direct answer, but you can find the description how it works in Python 3 here http://getpython3.com/diveintopython3/case-study-porting-chardet-to-python-3.html. After studying that, you may find the way how to detect another specific encoding.
The code was initially derived from Mozilla Seamonkey. You may find more information also there. Or look for some more advanced Python package related to Seamonkey.

Resources