PyPDF2 difference resulting in 1 character per line - python-3.x

im trying to create a simple script that will show me the difference (similar to github merging) by using difflib's HtmlDiff function.
so far ive gotten my pdf files together and am able to print their contents in binary using PyPDF2 functions.
import difflib
import os
import PyPDF2
os.chdir('.../MyPythonScripts/PDFtesterDifflib')
file1 = 'pdf1.pdf'
file2 = 'pdf2.pdf'
file1RL = open(file1, 'rb')
pdfreader1 = PyPDF2.PdfFileReader(file1RL)
PageOBJ1 = pdfreader1.getPage(0)
textOBJ1 = PageOBJ1.extractText()
file2RL = open(file2, 'rb')
pdfreader2 = PyPDF2.PdfFileReader(file2RL)
PageOBJ2 = pdfreader2.getPage(0)
textOBJ2 = PageOBJ2.extractText()
difference = difflib.HtmlDiff().make_file(textOBJ1,textOBJ2,file1,file2)
diff_report = open('...MyPythonScripts/PDFtesterDifflib/diff_report.html','w')
diff_report.write(difference)
diff_report.close()
the result is this:
How can i get my lines to read normally?
it should read:
1.apples
2.oranges
3. --this line should differ--
i am running python 3.6 on mac
Thanks in advance!

Related

nonetype can print but cannot be written into a txt file

I run the following code in python and it does retire str-like result but when I want to write into my text file, it either wrote nothing or returns that it is a nonetype, so can not be written into a file. how should I convert the result back into str and write it into a file? thanks
f1 = open('Documents/new_corpus/raw-corpus/seg/word_seg/pos/pos_depression33.txt','w')
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
import hanlp
hanlp.pretrained.mtl.ALL
a = []
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)
with open('Documents/new_corpus/raw-corpus/seg/word_seg/seg2_depression1.txt') as file:
for line in file:
HanLP([line],tasks = 'pos').pretty_print()

Issue with displaying results in the loop after collect()

What is the problem about?
I have a problem displaying data that has been read from a text file. The file (yields.txt) has 3 lines and it looks like a fourth line is being read as well, with some strange content.
File
File encoding: UTF-8 -> I also check for ASCII but same issue
EOL: Unix(LF) -> I also check for Windows (CRLF) but same issue
1 -0.0873962663951055 0.0194176287820278 -0.0097985244947938 -0.0457230361016478 -0.0912513154921251 0.0448220622524235
2 0.049279031957286 0.069222988721009 0.0428232461362216 0.0720027150750844 -0.0209348305073702 -0.0641023433269808
3 0.0770763924363555 -0.0790020383071036 -0.0601622344182963 -0.0207625817307966 -0.0193570710130222 -0.0959349375686872
Bug Description
Log from console
in mapper
return Row(ID=int(fields[0]),asset_1 = float(fields[1]), asset_2 = float(fields[2]), asset_3 = float(fields3),asset_4 = float(fields[4]), asset_5 = float(fields[5]), asset_6 = float(fields[6]))
ValueError: invalid literal for int() with base 10: b'PK\x03\x04\x14\x00\x00\x00\x08\x00AW\xef\xbf\xbdT\xef\xbf\xbdu\xef\xbf\xbdDZ\xef\xbf\xbd\x1e\x03i\x18\xef\xbf\xbd\x07'
I have also tried to find out what is within this content
and it is some strange data that does not appear in the text file at all which I checked with the script shown below:
import os
DATA_FOLDER_PATHNAME =
'\\'.join(os.path.dirname(__file__).split('\\')
[:-1])+'\\'+'data'+'\\'+'yields.txt'
with open(DATA_FOLDER_PATHNAME, 'r', encoding='ansi') as f:
print(f.read())
You can see that an empty line is visible but I do not know how to improve my code to avoid this bug.
Code
import findspark
import os
findspark.init(PATH_TO_SPARK)
from pyspark.sql import SparkSession
from pyspark.sql import Row
DATA_FOLDER_PATHNAME = '\\'.join(os.path.dirname(__file__).split('\\')[:-1])+'\\'+'data' # location of data file
def mapper(line):
fields = line.split()
return Row(ID=int(fields[0]),asset_1 = float(fields[1]), asset_2 = float(fields[2]), asset_3 = float(fields[3]),asset_4 = float(fields[4]), asset_5 = float(fields[5]), asset_6 = float(fields[6]))
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
lines = spark.sparkContext.textFile(DATA_FOLDER_PATHNAME, minPartitions = 2000, use_unicode = False)
assets_with_yields_rdd = lines.map(mapper)
assets_with_yields_df = spark.createDataFrame(assets_with_yields_rdd).cache()
assets_with_yields_df.createOrReplaceTempView('assets_with_yields_view')
assets_with_yields_view_df = spark.sql('select * from assets_with_yields_view')
print(80*'-')
for asset in assets_with_yields_view_df.collect():
print(asset)
print(80*'-')
spark.stop()
Question
Does anyone know what could cause such a weird issue?
The reason for this was that I had several files in one folder and it read them all in turn, so there were some discrepancies.

Python 3: Zip multiple files from string

I need to create a zip file from multiple txt files generated from strings.
import zipfile
from io import StringIO
def zip_files(file_arr):
# file_arr is an array of [(fname, fbuffer), ...]
f = StringIO()
z = zipfile.ZipFile(f, 'w', zipfile.ZIP_DEFLATED)
for f in file_arr:
z.writestr(f[0], f[1])
z.close()
return f.getvalue()
file1 = ('f1.txt', 'Question1\nQuestion2\n\nQuestion3')
file2 = ('f2.txt', 'Question4\nQuestion5\n\nQuestion6')
f_arr = [file1, file2]
return zip_files(f_arr)
This throws the error TypeError: string argument expected, got 'bytes' on writestr(). I have tried to use BytesIO instead of string IO, but get the same error. This is based on this answer which is able to do this for python 2.
I can't seem to find anything online about using zipfile for multiple files stored
Zip files are binary files, so you should use an io.BytesIO stream instead of an io.StringIO one.

python receive filename not contents - variable refers to (python3)

I have a script which I want to pass to odo. odo takes a filename as input, as I need to tidy the csv up first I pass it through a script to create a new file which I reference with a variable.
How can I get just the filename from the variable so I can pass it as an argument to odo(from blaze project).
You can see here that from this script pasted to ipython I get the entire contents of the file.
In [8]: %paste
from odo import odo
import pandas as pd
from clean2 import clean
import os
filegiven = '20150704RHIL0.csv'
myFile = clean(filegiven)
toUse = (filegiven + '_clean.csv')
print(os.path.realpath(toUse))
## -- End pasted text --
Surfin' Safari 3 0
... Many lines later
Search Squad (NZ) 4 5
C:\Users\sayth\Repos\Notebooks\20150704RHIL0.csv_clean.csv # from print
I just need to be able to get this name so my script could be, where myFile would give odo the filename not contents.
from odo import odo
import pandas as pd
from clean2 import clean
filegiven = '20150704RHIL0.csv'
myFile = clean(filegiven)
odo(myFile, pd.DataFrame)
Solution
this is how I solved it there would be better ways likely.
from odo import odo
import pandas as pd
from clean2 import clean
import os.path
filegiven = '20150704RHIL0.csv'
clean(filegiven)
fileName = os.path.basename(filegiven)
fileNameSplit = fileName.split(".")
fileNameUse = fileNameSplit[0] + '_clean.' + fileNameSplit[1]
odo(fileNameUse, pd.DataFrame)
To get a filename from a file object (assumings its standard File object in Python created using open() ) , you can use name variable in it.
Example -
>>> f = open("a.py",'r')
>>> f.name
'a.py'
Please note, for your situation this is unnecessary, maybe you can have your clean(filegiven) return filename instead of file object, and then if you really need the file object you can open it in your script.

parsing a remote pdf file with Python3 & PyPDF2

I need to parse a remote pdf file. With PyPDF2, it can be done by PdfReader(f), where f=urllib.request.urlopen("some-url").read() . f cannot be used by the PdfReader, and it seems that f has to be decoded. What argument should be used in decode(), or some other method has to be used.
You need to use:
f = urllib.request.urlopen("some-url").read()
Add these lines after above line:
from StringIO import StringIO
f = StringIO(f)
and then read using PdfReader as:
reader = PdfReader(f)
Also, refer: Opening pdf urls with pyPdf

Resources