parsing a remote pdf file with Python3 & PyPDF2

parsing a remote pdf file with Python3 & PyPDF2 - python-3.x

I need to parse a remote pdf file. With PyPDF2, it can be done by PdfReader(f), where f=urllib.request.urlopen("some-url").read() . f cannot be used by the PdfReader, and it seems that f has to be decoded. What argument should be used in decode(), or some other method has to be used.

You need to use:
f = urllib.request.urlopen("some-url").read()
Add these lines after above line:
from StringIO import StringIO
f = StringIO(f)
and then read using PdfReader as:
reader = PdfReader(f)
Also, refer: Opening pdf urls with pyPdf

Related

Python 3: Zip multiple files from string

I need to create a zip file from multiple txt files generated from strings.
import zipfile
from io import StringIO
def zip_files(file_arr):
# file_arr is an array of [(fname, fbuffer), ...]
f = StringIO()
z = zipfile.ZipFile(f, 'w', zipfile.ZIP_DEFLATED)
for f in file_arr:
z.writestr(f[0], f[1])
z.close()
return f.getvalue()
file1 = ('f1.txt', 'Question1\nQuestion2\n\nQuestion3')
file2 = ('f2.txt', 'Question4\nQuestion5\n\nQuestion6')
f_arr = [file1, file2]
return zip_files(f_arr)
This throws the error TypeError: string argument expected, got 'bytes' on writestr(). I have tried to use BytesIO instead of string IO, but get the same error. This is based on this answer which is able to do this for python 2.
I can't seem to find anything online about using zipfile for multiple files stored

Zip files are binary files, so you should use an io.BytesIO stream instead of an io.StringIO one.

How to convert image which type is bytes to numpy.ndarray?

I'm trying to optimize my code.
First, I get an image, which type is bytes
Then I have to write that image to file system.
with open('test2.jpg', 'wb') as f:
f.write(content)
Finally I read this image with
from scipy import misc
misc.imread('test2.jpg')
which convert image to np.array.
I want to skip part where I write image to file system, and get np.array.
P.S.: I tried to use np.frombuffer(). It doesn't work for me, cause two np.arrays are not the same.
Convert str to numpy.ndarray
For test you can try yourself:
file = open('test1.jpg', 'rb')
content = file.read()

My first answer in rap...
Wrap that puppy in a BytesIO
And away you go
So, to generate some synthetic data similar to what you get from the API:
file = open('image.jpg','rb')
content = file.read()
That looks like this which has all the hallmarks of a JPEG:
content = b'\xff\xd8\xff\xe0\x00\x10JFIF...
Now for the solution:
from io import BytesIO
from scipy import misc
numpyArray = misc.imread(BytesIO(content))

PyPDF2 difference resulting in 1 character per line

im trying to create a simple script that will show me the difference (similar to github merging) by using difflib's HtmlDiff function.
so far ive gotten my pdf files together and am able to print their contents in binary using PyPDF2 functions.
import difflib
import os
import PyPDF2
os.chdir('.../MyPythonScripts/PDFtesterDifflib')
file1 = 'pdf1.pdf'
file2 = 'pdf2.pdf'
file1RL = open(file1, 'rb')
pdfreader1 = PyPDF2.PdfFileReader(file1RL)
PageOBJ1 = pdfreader1.getPage(0)
textOBJ1 = PageOBJ1.extractText()
file2RL = open(file2, 'rb')
pdfreader2 = PyPDF2.PdfFileReader(file2RL)
PageOBJ2 = pdfreader2.getPage(0)
textOBJ2 = PageOBJ2.extractText()
difference = difflib.HtmlDiff().make_file(textOBJ1,textOBJ2,file1,file2)
diff_report = open('...MyPythonScripts/PDFtesterDifflib/diff_report.html','w')
diff_report.write(difference)
diff_report.close()
the result is this:
How can i get my lines to read normally?
it should read:
1.apples
2.oranges
3. --this line should differ--
i am running python 3.6 on mac
Thanks in advance!

python receive filename not contents - variable refers to (python3)

I have a script which I want to pass to odo. odo takes a filename as input, as I need to tidy the csv up first I pass it through a script to create a new file which I reference with a variable.
How can I get just the filename from the variable so I can pass it as an argument to odo(from blaze project).
You can see here that from this script pasted to ipython I get the entire contents of the file.
In [8]: %paste
from odo import odo
import pandas as pd
from clean2 import clean
import os
filegiven = '20150704RHIL0.csv'
myFile = clean(filegiven)
toUse = (filegiven + '_clean.csv')
print(os.path.realpath(toUse))
## -- End pasted text --
Surfin' Safari 3 0
... Many lines later
Search Squad (NZ) 4 5
C:\Users\sayth\Repos\Notebooks\20150704RHIL0.csv_clean.csv # from print
I just need to be able to get this name so my script could be, where myFile would give odo the filename not contents.
from odo import odo
import pandas as pd
from clean2 import clean
filegiven = '20150704RHIL0.csv'
myFile = clean(filegiven)
odo(myFile, pd.DataFrame)
Solution
this is how I solved it there would be better ways likely.
from odo import odo
import pandas as pd
from clean2 import clean
import os.path
filegiven = '20150704RHIL0.csv'
clean(filegiven)
fileName = os.path.basename(filegiven)
fileNameSplit = fileName.split(".")
fileNameUse = fileNameSplit[0] + '_clean.' + fileNameSplit[1]
odo(fileNameUse, pd.DataFrame)

To get a filename from a file object (assumings its standard File object in Python created using open() ) , you can use name variable in it.
Example -
>>> f = open("a.py",'r')
>>> f.name
'a.py'
Please note, for your situation this is unnecessary, maybe you can have your clean(filegiven) return filename instead of file object, and then if you really need the file object you can open it in your script.

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.

Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.

Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!

This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

parsing a remote pdf file with Python3 & PyPDF2 - python-3.x

I need to parse a remote pdf file. With PyPDF2, it can be done by PdfReader(f), where f=urllib.request.urlopen("some-url").read() . f cannot be used by the PdfReader, and it seems that f has to be decoded. What argument should be used in decode(), or some other method has to be used.

You need to use: f = urllib.request.urlopen("some-url").read() Add these lines after above line: from StringIO import StringIO f = StringIO(f) and then read using PdfReader as: reader = PdfReader(f) Also, refer: Opening pdf urls with pyPdf

Related

Python 3: Zip multiple files from string

How to convert image which type is bytes to numpy.ndarray?

PyPDF2 difference resulting in 1 character per line

python receive filename not contents - variable refers to (python3)

Custom filetype in Python 3

Categories

Resources