Issue with displaying results in the loop after collect() - apache-spark

What is the problem about?
I have a problem displaying data that has been read from a text file. The file (yields.txt) has 3 lines and it looks like a fourth line is being read as well, with some strange content.
File
File encoding: UTF-8 -> I also check for ASCII but same issue
EOL: Unix(LF) -> I also check for Windows (CRLF) but same issue
1 -0.0873962663951055 0.0194176287820278 -0.0097985244947938 -0.0457230361016478 -0.0912513154921251 0.0448220622524235
2 0.049279031957286 0.069222988721009 0.0428232461362216 0.0720027150750844 -0.0209348305073702 -0.0641023433269808
3 0.0770763924363555 -0.0790020383071036 -0.0601622344182963 -0.0207625817307966 -0.0193570710130222 -0.0959349375686872
Bug Description
Log from console
in mapper
return Row(ID=int(fields[0]),asset_1 = float(fields[1]), asset_2 = float(fields[2]), asset_3 = float(fields3),asset_4 = float(fields[4]), asset_5 = float(fields[5]), asset_6 = float(fields[6]))
ValueError: invalid literal for int() with base 10: b'PK\x03\x04\x14\x00\x00\x00\x08\x00AW\xef\xbf\xbdT\xef\xbf\xbdu\xef\xbf\xbdDZ\xef\xbf\xbd\x1e\x03i\x18\xef\xbf\xbd\x07'
I have also tried to find out what is within this content
and it is some strange data that does not appear in the text file at all which I checked with the script shown below:
import os
DATA_FOLDER_PATHNAME =
'\\'.join(os.path.dirname(__file__).split('\\')
[:-1])+'\\'+'data'+'\\'+'yields.txt'
with open(DATA_FOLDER_PATHNAME, 'r', encoding='ansi') as f:
print(f.read())
You can see that an empty line is visible but I do not know how to improve my code to avoid this bug.
Code
import findspark
import os
findspark.init(PATH_TO_SPARK)
from pyspark.sql import SparkSession
from pyspark.sql import Row
DATA_FOLDER_PATHNAME = '\\'.join(os.path.dirname(__file__).split('\\')[:-1])+'\\'+'data' # location of data file
def mapper(line):
fields = line.split()
return Row(ID=int(fields[0]),asset_1 = float(fields[1]), asset_2 = float(fields[2]), asset_3 = float(fields[3]),asset_4 = float(fields[4]), asset_5 = float(fields[5]), asset_6 = float(fields[6]))
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
lines = spark.sparkContext.textFile(DATA_FOLDER_PATHNAME, minPartitions = 2000, use_unicode = False)
assets_with_yields_rdd = lines.map(mapper)
assets_with_yields_df = spark.createDataFrame(assets_with_yields_rdd).cache()
assets_with_yields_df.createOrReplaceTempView('assets_with_yields_view')
assets_with_yields_view_df = spark.sql('select * from assets_with_yields_view')
print(80*'-')
for asset in assets_with_yields_view_df.collect():
print(asset)
print(80*'-')
spark.stop()
Question
Does anyone know what could cause such a weird issue?

The reason for this was that I had several files in one folder and it read them all in turn, so there were some discrepancies.

Related

For loop into a pandas dataframe

I have the following piece of code and it works but prints out data as it should. I'm trying (unsuccessfully) to putting the results into a dataframe so I can export the results to a csv file.
I am looping through a json file and the results are correct, I just need two columns that print out to go into a dataframe instead of printing the results. I took out the code that was causing the error so it will run.
import json
import requests
import re
import pandas as pd
data = {}
df = pd.DataFrame(columns=['subtechnique', 'name'])
df
RE_FOR_SUB_TECHNIQUE = r"(T\d+)\.(\d+)"
r = requests.get('https://raw.githubusercontent.com/mitre/cti/master/enterprise-attack/enterprise-attack.json', verify=False)
data = r.json()
objects = data['objects']
for obj in objects:
ext_ref = obj.get('external_references',[])
revoked = obj.get('revoked') or '*****'
subtechnique = obj.get('x_mitre_is_subtechnique')
name = obj.get('name')
for ref in ext_ref:
ext_id = ref.get('external_id') or ''
if ext_id:
re_match = re.match(RE_FOR_SUB_TECHNIQUE, ext_id)
if re_match:
technique = re_match.group(1)
sub_technique = re_match.group(2)
print('{},{}'.format(technique+'.'+sub_technique, name))
Unless there is an easier way to put the results of each row in the loop and have that append to a csv file.
Any help is appreciated.
Thanks
In this instance, it's likely easier to just write the csv file directly, rather than go through Pandas:
with open("enterprise_attack.csv", "w") as f:
my_writer = csv.writer(f)
for obj in objects:
ext_ref = obj.get('external_references',[])
revoked = obj.get('revoked') or '*****'
subtechnique = obj.get('x_mitre_is_subtechnique')
name = obj.get('name')
for ref in ext_ref:
ext_id = ref.get('external_id') or ''
if ext_id:
re_match = re.match(RE_FOR_SUB_TECHNIQUE, ext_id)
if re_match:
technique = re_match.group(1)
sub_technique = re_match.group(2)
print('{},{}'.format(technique+'.'+sub_technique, name))
my_writer.writerow([technique+"."+sub_technique, name])
It should be noted that the above will overwrite the output of any previous runs. If you wish to keep the output of multiple runs, change the file mode to "a":
with open("enterprise_attack.csv", "a") as f:

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

PyPDF2 difference resulting in 1 character per line

im trying to create a simple script that will show me the difference (similar to github merging) by using difflib's HtmlDiff function.
so far ive gotten my pdf files together and am able to print their contents in binary using PyPDF2 functions.
import difflib
import os
import PyPDF2
os.chdir('.../MyPythonScripts/PDFtesterDifflib')
file1 = 'pdf1.pdf'
file2 = 'pdf2.pdf'
file1RL = open(file1, 'rb')
pdfreader1 = PyPDF2.PdfFileReader(file1RL)
PageOBJ1 = pdfreader1.getPage(0)
textOBJ1 = PageOBJ1.extractText()
file2RL = open(file2, 'rb')
pdfreader2 = PyPDF2.PdfFileReader(file2RL)
PageOBJ2 = pdfreader2.getPage(0)
textOBJ2 = PageOBJ2.extractText()
difference = difflib.HtmlDiff().make_file(textOBJ1,textOBJ2,file1,file2)
diff_report = open('...MyPythonScripts/PDFtesterDifflib/diff_report.html','w')
diff_report.write(difference)
diff_report.close()
the result is this:
How can i get my lines to read normally?
it should read:
1.apples
2.oranges
3. --this line should differ--
i am running python 3.6 on mac
Thanks in advance!

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Why is csvreader for Python starting good then producing NULL bytes?

OKay so I am reading an excel workbook. I read the file for a while and it started off a .csv after debugging and doing other things below the code i am showing you it changed to a xlsx I started getting IOError no such file or directory. I figured out why and changed FFA.csv to FFA.xlsx and it worked error free. Then I started doing other things and debugging. Got up this morning and now i get the following Error : line contains NULL byte. weird because the code started out good. Now it can't read. I put in the print repr() to debug and it infact now prints NULL bytes. So how do i fix this and prevent it in the future? here is the 1st 200 bytes:
PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00b\xee\x9dh^\x01\x00\x00\x90\x04\x00\x00\x13\x00\x08\x02[Content_Types].xml \xa2\x04\x02(\xa0\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
import csv
def readFile():
count = 0
print repr(open("FFA.xlsx", "rb").read(200)) #dump 1st 200 bytes
with open("FFA.xlsx","rb") as csvfile:
FFAreader = csv.reader(csvfile, delimiter=",")
for row in FFAreader:
idd = row[0]
name = row[1]
pos = row[2]
team = row[3]
pts = row[4]
oecr = row[5]
oR = row[6]
posR = row[7]
up = row[8]
low =row[9]
risk = row[10]
swing = row[11]
readFile()
The code you have posted have a small but dangerous mistake, since you are leaking the file handle by opening it twice.
1) You are opening the file and reading 200 bytes from it, but not closing it.
2) You are then opening the file the proper way, via a context manager, which in fact could read anything from it.
Some questions that may help you to debug the problem:
Is the file you are opening stored in a network'd resource? (CIFS, NFS, etc)
Have you checked the file is not opened by another process? lsof can help you to check that.
Is this running on windows or Linux? Can you test in under linux, if it happens in windows, and viceversa?
I forgot to mention that you should not use CSV for anything related to Excel, even when the file seems to be a CSV data-wise. Use XLRD module (https://pypi.python.org/pypi/xlrd) , it's cross-platform and opens and reads perfectly fine both XSL and XSLX files since version 0.8.
This little piece of code will show you how to open the workbook and parse it in a basic manner:
import xlrd
def open_excel():
with xlrd.open_workbook('FFA.xlsx') as wb:
sh = wb.sheet_by_name('Sheet1')
for rownum in xrange(sh.nrows):
[Do whatever you need here]
I agree with Marc, I did a training exercise importing an excel file and I think pandas library would help in that case where you can import pandas as pd and use pd.read_excel(file_name) as part of a data_processing function like read_file() post import.
So this is what I did. But I am intersted in learning the xlrd method i have the module but no documentation. This works no error messages. Still not sure why it changed from .csv to xlsx but its working now. What is the script like in xlrd?
import csv
def readFile():
count = 0
#print repr(open("FFA.csv", "rb").read(200)) #dump 1st 200 bytes check if null values produced.
with open("FFA.csv","rb") as csvfile:
FFAreader = csv.reader(csvfile, delimiter=",")
for row in FFAreader:
idd = row[0]
name = row[1]
pos = row[2]
team = row[3]
pts = row[4]
oecr = row[5]
oR = row[6]
posR = row[7]
up = row[8]
low =row[9]
risk = row[10]
swing = row[11]
readFile()

Resources