The following code tries to edit part of text in a PDF file:
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject
in_file="input.pdf"
pdf = PdfFileReader(in_file)
#Just first page is subjected to be edited
page=pdf.pages[0]
contents=page["/Contents"]
#contents[1] is a IndirectObject of PyPDF2, so EncodedStreamObject can be obtained by get_object()
ogg=contents[1].get_object()
#obtaining byte datas
enc_data=ogg.get_data()
#decoding (in string) in order to be editable
dec_data=enc_data.decode('utf-8')
new_dec_data=dec_data.replace("old text string","new text string")
#returning to bytes format but with new text replaced
new_enc_data=new_dec_data.encode('utf-8')
#HERE is the problem !
#Looking in script lib i couldnt resolve the final step. setData() doesnt work as it should.
ogg.decodedSelf.setData( new_enc_data)
#print(ogg)
writer = PdfFileWriter()
writer.addPage(page)
with open("output.pdf", 'wb') as out_file:
writer.write(out_file)
Of course output.pdf corresponds to original input pdf file.
Just linking the interested object : https://fossies.org/dox/openslides-2.3-portable/classPyPDF2_1_1generic_1_1EncodedStreamObject.html
Has anyone else experienced the same problem ?
Maybe im not understanding actual issue.
Resolved from myself.
EncodedStreamObject's setData() doesn't prevent to edit its private attribute _data. So you can edit it externally.
ogg._data = new_enc_data
Related
So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)
I am using the Tabula module in Python.
I am trying to output text from a PDF.
I am using this code:
pdf_read = tabula.read_pdf(
input_path = "Test File.pdf",
pages = start_page_number,
guess=False,
area=(81.735,18.55,391.285,273.61),
relative_area = False,
format="TSV",
output_path="testing_area.tsv"
)
When I go to run my code, it says "The output file is empty."
Any idea why this could be?
Edit: If I remove everything except the input_path and pages, my data is getting read into pdf_read correctly, it just does not output into an external file.
Something is wrong with this option...hmm...
Edit #2: I figured out why the area part was not working and now it is, but I still can't get this to output a file for some reason.
Edit #3: I tried looking at this: How to convert PDF to CSV with tabula-py?
But I keep getting an error message: "build_options() got an unexpected keyword argument 'spreadsheet'
Edit #4: I'm using the latest version of tabula.py, which doesn't have the spreadsheet option.
Still can't output a file with data though.
I don't know why that wasn't working above, so the output of pdf_read is a list.
I converted the list into a dataframe and then output the dataframe using to_csv.
Code is below:
import pandas as pd
df = pd.DataFrame(pdf_read,columns=["column_a"])
output_df = df.to_csv(
"alternative_attempt_1.txt",
header=True,
index=True,
sep='\t',
mode='w',
encoding="cp1252"
)
I'm trying to optimize my code.
First, I get an image, which type is bytes
Then I have to write that image to file system.
with open('test2.jpg', 'wb') as f:
f.write(content)
Finally I read this image with
from scipy import misc
misc.imread('test2.jpg')
which convert image to np.array.
I want to skip part where I write image to file system, and get np.array.
P.S.: I tried to use np.frombuffer(). It doesn't work for me, cause two np.arrays are not the same.
Convert str to numpy.ndarray
For test you can try yourself:
file = open('test1.jpg', 'rb')
content = file.read()
My first answer in rap...
Wrap that puppy in a BytesIO
And away you go
So, to generate some synthetic data similar to what you get from the API:
file = open('image.jpg','rb')
content = file.read()
That looks like this which has all the hallmarks of a JPEG:
content = b'\xff\xd8\xff\xe0\x00\x10JFIF...
Now for the solution:
from io import BytesIO
from scipy import misc
numpyArray = misc.imread(BytesIO(content))
First off i must say i am VERY new to programming (less then a week experience in total). I set out to write a program that generates a series of documents of an .odt template. I want to use a template with a specific keyword lets say "X1234X" and so on. This will then be replaced by values generated from the program. Each document is a little different and the values are entered and calculated via a prompt (dates and other things)
I wrote most of the code so far but i am stuck since 2 days on that problem. I used the ezodf module to generate a new document (with different filenames) from a template but i am stuck on how to edit the content.
I googled hard but came up empty hope someone here could help. I tried reading the documentations but i must be honest...its a bit tough to understand. I am not familiar with the "slang"
Thanks
PS: a ezodf method would be great, but any other ways will do too. The program doesnt have to be pretty it just has to work (so i can work less ^_^)
Well i figured it out. nd finished the program. I used a ezodf to create the file, then zipfile to extract and edit the content.xml and then repacked the whole thing via a nice >def thingy< from here. I tried to mess with etree...but i couldnt figure it out...
from ezodf import newdoc
import os
import zipfile
import tempfile
for s in temp2:
input2 = s
input2 = str(s)
input1 = cname[0]
file1 = '.odt'
namef = input2 + input1 + file1
odt = newdoc(doctype='odt', filename=namef, template='template.odt')
odt.save()
a = zipfile.ZipFile('template.odt')
content = a.read('content.xml')
content = str(content.decode(encoding='utf8'))
content = str.replace(content,"XXDATEXX", input2)
content = str.replace(content, 'XXNAMEXX', input1)
def updateZip(zipname, filename, data):
# generate a temp file
tmpfd, tmpname = tempfile.mkstemp(dir=os.path.dirname(zipname))
os.close(tmpfd)
# create a temp copy of the archive without filename
with zipfile.ZipFile(zipname, 'r') as zin:
with zipfile.ZipFile(tmpname, 'w') as zout:
zout.comment = zin.comment # preserve the comment
for item in zin.infolist():
if item.filename != filename:
zout.writestr(item, zin.read(item.filename))
# replace with the temp archive
os.remove(zipname)
os.rename(tmpname, zipname)
# now add filename with its new data
with zipfile.ZipFile(zipname, mode='a', compression=zipfile.ZIP_DEFLATED) as zf:
zf.writestr(filename, data)
updateZip(namef, 'content.xml', content)
How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.