I'm using LibreOffice CLI to convert two files I have:
.XLSX to .PDF;
.PPTX to .PDF;
To do the task I'm using the below command:
import subprocess
from subprocess import Popen, PIPE
import re
args = ['libreoffice', '--headless', '--convert-to',
'pdf', '--outdir', 'output', 'my_file.pptx']
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=25)
re.search('-> (.*?) using filter', process.stdout.decode())
It is converting the files to pdf, but unfortunately, it's not keeping the same layout style
Continuation of the previous page (with the footer) and header of the next page:
As you can see in the images, the headers have different size and it also has a weird value in the header "x000a"
I would like to understand if there are a alternative to fix this kind of problems?
Regards,
Leonardo
You should not expect a 100% replica of Microsoft Office output from LibreOffice. Anyway, if you think this is a bug, you can report the bug to LibreOffice Bugzilla with a good description and a sample documents to reproduce the bug:
https://bugs.documentfoundation.org/
There are faster ways to convert a document to PDF, if you can have a running LibreOffice in the background. For example, see this project which is written in Python:
Python script to automate document conversions using LibreOffice/OpenOffice.org
https://github.com/mirkonasato/pyodconverter
Related
I have been searching the web for hours trying to find something that might help me convert a file that was saved in the ppt file type to the pptx file type using python. I found "python-pptx" and was planning on using it to save the files, however this was not possible due to the continuous error:
Package not found at 'FileName.ppt'
I discovered another post (Convert ppt file to pptx in Python) which did not help me at all. I assume it is because my python version might be too high. (3.9) After reading up on getting the win32com.client to work and installing multiple pip and pip3 commands, it is still not working. If anyone could assist me with this manner I would be very thankful. My Current Code:
from pptx import *
prs = Presentation("FileName.ppt")
prs.save("FileName.pptx")
You can use Aspose.Slides for .NET and Python.NET package for converting PPT to PPTX as shown below:
import clr
clr.AddReference('Aspose.Slides')
from Aspose.Slides import Presentation
from Aspose.Slides.Export import SaveFormat
# Instantiate a Presentation object that represents a PPT file
presentation = Presentation("presentation.ppt")
# Save the presentation as PPTX
presentation.Save("presentation.pptx", SaveFormat.Pptx)
Our web applications use our libraries and you can see conversion results here.
I work at Aspose.
I doubt python-pptx can parse a .ppt file. (It's a completely different file format.) You're better off automating PowerPoint itself - somehow - to read one and write the other.
The "somehow" depends on the platform you're running on - and the automation capabilities available to you.
I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("\n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)
Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.
I am working on a script that "reads" PDF files and and then automatically renames the files it recognizes from a dictionary. PyPDF2 however only returns empty lines for some PDFs, while working fine for others. The code for reading the files:
import PyPDF2
# File name
file = 'sample.pdf'
# Open File
with open(file, "rb") as f:
# Read in file
pdfReader = PyPDF2.PdfFileReader(f)
# Check number of pages
number_of_pages = pdfReader.numPages
print(number_of_pages)
# Get first page
pageObj = pdfReader.getPage(0)
# Extract text from page 1
text = pageObj.extractText()
print(text)
It does get the number of pages correctly, so it is able to open the PDF.
If I replace the print(text) by repr(text) for files it doesn't read, I get something like:
"'\\n\\n\\n\\n\\n\\n\\n\\nn\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
Weirdly enough, when I enhance (OCR) the files with Adobe, the script performs slightly worse. It Recognized 140 out of 800 files and after enhancing just 110.
The PDFs are machine readable/searchable, because I am able to copy/paste text to Notepad. I tested some files with "pdfminer" and it does show some text, but also throws in a lot of errors. If possible I like to keep working with PyPDF2.
Specifications of the software I am using:
Windows: 10.0.15063
Python: 3.6.1
PyPDF: 1.26.0
Adobe version: 17.009.20058
Anyone any suggestions? Your help is very much appreciated!
I had the same issue, i fixed it using another python library called slate
Fortunately, i found a fork that works in Python 3.6.5
import slate3k as slate
with open(file.pdf,'rb') as f:
extracted_text = slate.PDF(f)
print(extracted_text)
I'm coming from R + Rstudio. In RStudio, you can save objects to an .RData file using save()
save(object_to_save, file = "C:/path/where/RData/file/will/be/saved.RData")
You can then load() the objects :
load(file = "C:/path/where/RData/file/was/saved.RData")
I'm now using Spyder and Python3, and I was wondering if the same thing is possible.
I'm aware everything in the globalenv can be saved to a .spydata using this :
But I'm looking for a way to save to a .spydata file in the code. Basically, just the code under the buttons.
Bonus points if the answer includes a way to save an object (or multiple objects) and not the whole env.
(Please note I'm not looking for an answer using pickle or shelve, but really something similar to R's load() and save().)
(Spyder developer here) There's no way to do what you ask for with a command in Spyder consoles.
If you'd like to see this in a future Spyder release, please open an issue in our issues tracker about it, so we don't forget to consider it.
Considering the comment here, we can
rename the file from .spydata to .tar
extract the file (using file manager, for example). It will deliver a file .pickle (and maybe a .npy)
extract the objects saved from the environment:
import pickle
with open(path, 'rb') as f:
data_temp = pickle.load(f)
that object will be a dictionary with the objects saved.